Data in Beta — GeistHaus

jmh May 10, 2023

I can’t resist blogging about the performance results we’re seeing for the Hydroflow dataflow runtime. This is a Rust-based library that is part of the Hydro project (more background below!) You can think about Hydroflow as a single-node dataflow library, kind of like Spark or Pandas, written in Rust. Except that (a) Hydroflow is tuned […]

Show full content

Fast Hydroflow! This gorgeous photo is from the lensscaper blog, https://lensscaper.wordpress.com/2012/09/25/fast-flowing-water/

I can’t resist blogging about the performance results we’re seeing for the Hydroflow dataflow runtime. This is a Rust-based library that is part of the Hydro project (more background below!)

You can think about Hydroflow as a single-node dataflow library, kind of like Spark or Pandas, written in Rust. Except that (a) Hydroflow is tuned for extremely low-latency handling of data, (b) it has networking functionality so that many Hydroflow programs can work in concert as a distributed system, and (c) it has semantic properties that make it amenable to high-level distributed optimizations and correctness analysis. The intended use case for Hydroflow is to be a runtime kernel for high-performance distributed systems.

And perform it does! Using Hydroflow, we have recently been able to get some world-beating performance numbers for classic distributed systems challenge problems. These are rough numbers, not scholarly results, but provide strong evidence that Hydroflow is as fast or faster than handwritten code in languages like C++ and Scala, despite being a higher-level language that is is amenable to many kinds of distributed optimizations and program checks in future.

Meanwhile, on to the numbers!

Case Study #1: Compartmentalized Paxos

A few years back, the inimitable Michael Whittaker was doing his Ph.D. at Berkeley. A serious student of distributed protocols, he observed that the godparent of consensus—Multipaxos—could be scaled up with simple but elegant techniques he called “compartmentalization”. Using these simple techniques he was able to build an implementation of MultiPaxos whose throughput scaled better than many much more complex schemes in the literature. All this is documented in his VLDB 2021 paper on Compartmentalized Paxos. Michael implemented Compartmentalized Paxos by hand in Scala.

The Hydro team recently reimplemented Michael’s scheme in a high-level spatiotemporal logic language called Dedalus, which compiles down to Hydroflow, which in turn is a Rust library that is compiled by llvm. And here’s the results to the right. The curves to focus on are the ones on the far right, which are giving huge throughput at low latency. “CScalablePaxos” is Michael Whittaker’s Scala code; “DedalusCScalablePaxos” is our Dedalus implementation of Michael’s design. All this is running on GCP using 20 n2-standard-4 machines
with 4 vCPUs, 16 GB RAM, and 10 Gbps network bandwidth, with one machine per node. Bottom line: Hydroflow’s performance on Compartmentalized Paxos is a bit better than Michael Whittaker’s state-of-the-art (as of 2021) handwritten Scala code. (The other curves are experimental results that will hopefully get explained in an upcoming paper.)

Case Study #2: Anna KVS

Readers of this blog will be familiar with the Anna Key-Value Store database. Anna was designed to run coordination-free across cores and machines. Upon first publication, it was especially good at high-contention workloads that other systems would fail at–it was up to 700x faster than Masstree, up to 800x Intel’s “lock-free” TBB hash table. Mind you it didn’t provide serializable consistency, but it did provide many coordination-free consistency levels, including Causal Consistency and Read Committed isolation. For our purposes here though, the question is whether a Hydroflow implementation can compete with Anna’s native C++ implementation.

Unlike Compartmentalized Paxos, the original experimental harness for Anna is lost in the mists of time and git. As a result it’s hard to recreate the graphs from the original Anna paper as we did for compartmentalized Paxos. The best we can do is show the 5-year-old Anna graph next to a graph of a Hydroflow implementation on the same workload using similar machines. And here are the results! The first, colorful graph is from the original Anna paper; the second shows new results from Hydroflow.

Scaling Anna across cores, from Wu, et al. 2018.

Hydroflow implementation of Anna running across cores, 2023.

Note the sweet linear scaling in both cases — that’s thanks to Anna’s coordination-free design. Note also that the Hydroflow implementation’s throughput is in the same order of magnitude as C++ Anna, i.e. 2+ orders of magnitude faster than Masstree and TBB. Mind you the original numbers in the first chart are from Amazon m4.16xlarge instances (64 vCPU and 256GB RAM) in 2018, and the Hydroflow numbers are from GCP n2-standard-64 instances (64 vCPU, 256GB RAM) in 2023. So while I won’t claim that Hydroflow is “faster” than the C++ codebase, it’s very clearly in the ballpark and shows evidence of being about as fast as handwritten C++.

HT to Lucky Katahanas and Mingwei Samuel, the primary developers of Hydroflow, as well as David Chu, Chris Liu, Shadaj Laddad, Rithvik Panchapakesan who all contributed to the Compartmentalized Paxos implementation. (A modest fraction of Hydroflow code was also written by yours truly.)

Background on Hydro

The Hydro project is a multi-year effort at UC Berkeley and Sutter Hill Ventures. The Hydro Project is developing cloud-native programming models that allow anyone to develop scalable and resilient distributed applications that take full advantage of cloud elasticity.

At the time of writing, Hydroflow is maturing into a high-performance open source runtime for low-latency dataflow. Hydroflow is designed with two use cases in mind:

Expert developers can program Hydroflow directly to build components in a distributed system.
Higher levels of the Hydro stack will offer friendlier languages with more abstractions, and treat Hydroflow as a compiler target.

Hydroflow provides a DSL—a surface syntax—embedded in Rust, which compiles to high-efficiency machine code. As the lowest level of the Hydro stack, Hydroflow requires some knowledge of Rust to use. Hydroflow is well-documented online, and via its source code on github.

The Hydro Stack as envisioned in the CIDR 2021 paper.

We laid out our vision for Hydro in a paper at CIDR 2021 called “New Directions in Cloud Programming”. In that paper we proposed a compiler stack with multiple components, as shown to the right. At the highest level of input, we hope to support a multitude of distributed programming styles, much like LLVM provides a compiler stack for a multitude of sequential programming languages. Also like LLVM, Hydro envisions multiple layers of intermediate representation languages (IRs) that can serve as common ground for program checks and transformations, providing developers with various entry points to work deeper in the stack as they see fit. The top layer of Hydro we call Hydraulic: a system to lift low-level code from legacy interfaces into a higher-level declarative distributed IR we dub Hydrologic. The next layer down is an optimizing compiler we call Hydrolysis, which takes Hydrologic and compiles it to run on multiple instances of a single-threaded, asynchronous dataflow IR called Hydroflow.

Want to learn more? Check out our webpage at https://hydro.run.

http://databeta.wordpress.com/?p=941

Extensions

A History of Postgres

jmh Jan 9, 2019

The ACM began commissioning a series of reminiscence books on Turing Award winners. Thanks to hard work by editor Michael Brodie, the first one is Mike Stonebraker’s book, which just came out. I was asked to write the chapter on Postgres. I was one of the large and distinguished crew of grad students on the […]

Show full content

A Berkeley Database Research Group T-shirt circa 1992. — This is an image of a t-shirt from the Berkeley Database Research Group circa 1994. According to Paul Aoki, “The turtle was adopted as a mascot by the INGRES group in the ’70s because ‘it’s slow but it gets there.’ It was retained as the POSTGRES mascot for sentimental reasons.”

The ACM began commissioning a series of reminiscence books on Turing Award winners. Thanks to hard work by editor Michael Brodie, the first one is Mike Stonebraker’s book, which just came out. I was asked to write the chapter on Postgres. I was one of the large and distinguished crew of grad students on the Postgres project, so this was fun.

ACM in its wisdom decided that these books would be published in a relatively traditional fashion—i.e. you have to pay for them. The publisher, Morgan-Claypool, has this tip for students and ACM members:

Please note that the Bitly link goes to a landing page where Students, ACM Members, and Institutions who have access to the ACM Digital Library can follow a link and look to see if they can get the book through their ACM membership. Students who are ACM members can purchase for a very low rate…the retail bookstore is more for corporate customers.

(Update 12/1/2019: the bitly link is no longer live. I have replaced it with a link to the ACM Digital Library page for the book. I was able to download the full book for free with my ACM membership; hopefully this is true for others as well.)

I insisted that my article be available for free. So they’ve decided that each book will have a “sample chapter”, and for Mike’s it’s this Postgres chapter. The ACM-formatted version is available free at the link above, but I have a standalone PDF up on arXiv and latex source on github.

If you see any errors or you think I should say more about something you know about, please submit a PR on github! I’ll update the arXiv version periodically.

http://databeta.wordpress.com/?p=911

Extensions

An Overview of the CALM Theorem

jmh Jan 8, 2019

For folks who care about what’s possible in distributed computing: Peter Alvaro and I wrote an introduction to the CALM Theorem and subsequent work that is now up on arXiv. The CALM Theorem formally characterizes the class of programs that can achieve distributed consistency without the use of coordination. I spent a good fraction of […]

Show full content

Original Author: Nick Youngson - link to - http://www.nyphotographic.com/

Original Image: http://www.picpedia.org/highway-signs/c/calm.html

For folks who care about what’s possible in distributed computing: Peter Alvaro and I wrote an introduction to the CALM Theorem and subsequent work that is now up on arXiv. The CALM Theorem formally characterizes the class of programs that can achieve distributed consistency without the use of coordination.

I spent a good fraction of my academic life in the last decade working on a deeper understanding of how to program the cloud and other large-scale distributed systems. I was enormously lucky to collaborate with and learn from amazing friends over this period in the BOOM project, and see our work picked up and extended by new friends and colleagues.

Our research was motivated by simple questions, chief among them this:

Q: “What is the hardest thing about distributed systems?”
A: “Coordination and consistency.”

Protocols like Two-Phase Commit, Paxos and their myriad offspring are celebrated for being tricky, and as such form the backbone of academic classes on distributed computing. But trickiness is not a hallmark of good software design. In practice, coordination is the source of much of the complexity and inefficiency of distributed systems, and it is avoided when possible by good engineers.

So we moved to a more fundamental question:

Q: When can we correctly avoid coordination, and when are we absolutely required to use it?
A (circa 2010): Unknown.

Surprisingly, this computability question was one that the pioneers of distributed systems never answered, at least not in any sense of algorithms or program semantics. The discussion in the literature was cast in terms of “memory models” or “storage consistency” guarantees so low down the stack as to be irrelevant and unhelpful to most application designers.

In a keynote talk at PODS 2010, I proposed an answer to this open question. I conjectured—based on my team’s experience with streaming queries and declarative networking—that coordination was needed if and only if you had a computational task that could not be expressed with monotonic logic. I called this idea CALM: Consistency as Logical Monotonicity. Not long thereafter a formalization and proof of the CALM Theorem was provided by Ameloot, Neven and Van den Bussche over at Hasselt University in Belgium. Related work ensued across both sides of the Atlantic on additional theoretical results and practical uses of the idea for program analysis.

I sense that this body of work deserves more attention today, when distributed computing is becoming the norm rather than the exception. CALM provides a formal basis for a myriad of conversations over the last 15 years regarding what is possible to get correct with “eventual consistency”, “noSQL”, “commutativity”, “ACID 2.0”, “CRDTs” and other pragmatics. It provides the nuanced answer to screeds about “beating the CAP Theorem”. It also lays the groundwork for what we did later with the Bloom language: provide a programming model where the really hard issues of distributed programming are first-order concerns of the language and its syntax.

To bring these issues to a wider audience, I sat down with the inimitable Peter Alvaro to write up what we hope is an approachable but sufficiently meaty intro to the CALM Theorem, its implications, and the many open questions remaining. It took a while for this to get to the top of our stacks, but the paper is now up on arXiv.

We’re spinning up a new generation of work on cloud programming here at Berkeley’s RISELab that builds on these lessons. Watch this space!

http://databeta.wordpress.com/?p=914

Extensions

Serverless Computing: One Step FW, Two Steps Back

jmh Dec 13, 2018

Colleagues at Berkeley and I have a new paper on the state of serverless computing that will appear at CIDR '19. It celebrates the arrival of public-facing autoscaling cloud programming, but critiques the current serverless offerings for turning their back on the very hallmarks of what makes the cloud exciting: data-centric and distributed computing. We hope it will start a constructive discussion on how to expose the right programming APIs and runtimes for the cloud.

Show full content

escharian_stairs_fb tl;dr: Colleagues at Berkeley and I have a new paper on the state of serverless computing that will appear at CIDR ’19. It celebrates the arrival of public-facing autoscaling cloud programming, but critiques the current serverless offerings for thwarting the hallmarks of what makes the cloud exciting: data-centric and distributed computing. We hope the paper will start a constructive discussion on how to expose the right programming APIs and runtimes for the cloud.

I’ve been fascinated by the potential of cloud computing for a decade now, prior to starting the Berkeley Orders Of Magnitude (BOOM) project. The cloud is the machine of a dream: more data and computing power than anyone could ever need, available to everyone.

Since the beginning, I’ve felt that a new platform like the cloud needs new programming languages that suit its “physics”. Once that is achieved, unexpected innovation will follow from the creativity of a world of developers. In the case of the cloud, the physical reality is a deeply data-rich, massively distributed, heterogeneous computing environment with the ability to grow and shrink consumption on demand. We have never had a computer with this power or this programming complexity.

After 10 years of people writing cloud programs in legacy sequential languages like Java, the public cloud providers are finally proposing a programming model for the cloud. They are calling it Serverless Computing, or more descriptively “Functions as a Service” (FaaS). As an interface to the unprecedented potential of the cloud, FaaS today is a disappointment. Current FaaS offerings do provide a taste of the power of autoscaling, but they have fatal flaws when it comes to the basic physics of the cloud: they make it impossible to do serious distributed computing, and crazy expensive/slow to work with data at scale. This is not a roadmap for harnessing the creativity of the developer community.

I sat down with colleagues at Berkeley to write what we think is a tough but fair assessment of where Serverless Computing is today, and describe the changes we think are needed to provide a programming model that matches the cloud’s physics and unlocks its potential. The paper will appear at CIDR 2019 in January, but a preprint is available on arXiv. We’re looking forward to the conversation that ensues.

We also have constructive research ongoing in this domain … watch this space for more!

http://databeta.wordpress.com/?p=900

Extensions

The Crossroads of AI and Database Algorithms: Query Optimization

jmh Sep 20, 2018

tl;dr: We observed that Dynamic Programming is the common base of both database query optimization and reinforcement learning. Based on this, we designed a deep reinforcement learning algorithm for database query optimization we call DQ. We show that DQ is highly effective and more generally adaptable than any of the prior approaches in the database […]

Show full content

tl;dr: We observed that Dynamic Programming is the common base of both database query optimization and reinforcement learning. Based on this, we designed a deep reinforcement learning algorithm for database query optimization we call DQ. We show that DQ is highly effective and more generally adaptable than any of the prior approaches in the database literature. We feel this is a particularly good example of AI and Database research coming together: both because of the shared algorithmic kernel, and because of the pragmatic need to resort to effective data-driven heuristics in practice. A preprint of the paper–Learning to Optimize Join Queries With Deep Reinforcement Learning–is on arXiv; a more technical blog post is available as well.

The Database and AI communities have a long, long history of flirtation, cross-pollination, and occasional mutual dissing. The points of overlap go back to joint roots in querying (Information Retrieval and Database Queries), representations of information (Data Models and Knowledge Representation), logic-based programming (Prolog and Datalog, Production Systems and Active Databases) and applied statistics at scale (parallel query processing and machine learning). Since the beginning, the database community has prided itself on fostering both intellectual depth and the kind of applied relevance that is reflected by multi-billion-dollar industry generation. Since the rise of Web search, the AI community has steadily moved into the realm of practicality as well. We are in another heady cross-pollination phase today, as the fields borrow and steal from each other at will.

On this note, there was excitement last year about a paper by Kraska and colleagues showing that B-tree-like indexes could be “learned” by a neural net. The core of the paper was a re-observation (see Antoshenkov’s Ranked B-Trees or Section 8.4 of the New Jersey Data Reduction Report) that a B-tree is essentially a hierarchy of distributions. From there, the novelty in the Kraska paper was to learn those distributions with a neural net, rather than the traditional algorithms and heuristics for tree construction. Neural nets are a data-driven heuristic for problems that defy algorithmic solutions. For the 1-dimensional indexing case of that paper, the classical algorithms are both asymptotically optimal and practically effective. Which leads to the question: what database systems problems merit exploration with new AI techniques?

The answer is straightforward: Query Optimization. SQL query optimization is a classical and practical piece of artifically intelligent magic: it takes a high-level user request and automatically synthesizes custom software for executing it at scale, replacing the need for COBOL (or MapReduce) programmers. The core algorithm for query optimization searches the space of potential programs via Dynamic Programming, a classic algorithmic kernel for optimization. Meanwhile, heuristic but statistically-grounded methods are used to estimate costs and rewards of various choices.

Dynamic Programming is also an important algorithm for AI search strategies. In particular, it forms the kernel of Reinforcement Learning (RL), the approach used successfully in recent famous AI game-playing milestones (Go, Atari games, etc.)

In recent work, we have been exploiting this algorithmic connection, mapping the classical System R optimization algorithm into an RL framework. The big benefit here is in developing asymptotically efficient data-driven heuristics for searching the exponential plan space. By contrast, the System R query optimizer prunes the plan space with ad-hoc heuristics that ignore the state of the database (“left-deep” trees only, postpone cross-products); moreover, after that pruning the remaining space is still exponential. Optimizers for “big” queries need to prune more aggressively, and typically do so via other data-blind heuristics. (Neumann and Radke surveyed and extended this space recently in the context of the Hyper system used inside Tableau — a highly recommended read!) Our novelty is to emply Deep RL for our pruning mechanism, in the context of Bellman’s original Dynamic Programming approach, which reduces the problem from searching an exponential space to a polynomial space. Unlike previous polynomial-time heuristics (e.g. the KBZ approach that inspired me back in graduate school), our “assumptions” are based on data and feedback, the hallmarks of learning algorithms. As a result, we are able to regularly outperform the classical heuristics and show adaptivity across a wide range of scenarios, where prior heuristics only worked in their “sweet spots”–again, a hallmark of learning.

To my mind, this is a particularly well-suited meeting point of Database and AI research: substantive cross-pollination of algorithmic roots, and an application of learning to a problem that requires data-driven, adaptive heuristics.

This work was led by the inimitable Sanjay Krishnan and Zongheng Yang, with assistance from Ken Goldberg, Ion Stoica and myself.

Credit to Mehul Shah for articulating the connection between query optimization and AI when he taught CS186 at Berkeley.

http://databeta.wordpress.com/?p=893

Extensions

Significant Update to Anna

jmh Sep 7, 2018

Over at the RISElab blog, we have a post on the latest updates to the Anna KVS. Anna started out as the fastest KVS we are aware of (by orders of magnitude!), with the widest choice of consistency models. It achieved that in part by using beaucoup de resources to replicate entire databases in memory. […]

Show full content

In addition to being fast and lightweight, Anna now wins big on efficiency due to selective replication, multitiering and autoscaling. Over at the RISElab blog, we have a post on the latest updates to the Anna KVS. Anna started out as the fastest KVS we are aware of (by orders of magnitude!), with the widest choice of consistency models. It achieved that in part by using beaucoup de resources to replicate entire databases in memory.

Thanks to recent work in the lab, Anna is not only incredibly fast, it’s incredibly efficient and elastic too: an autoscaling, multi-tier, selectively-replicating cloud service. All that adaptivity means that Anna ramps down resource consumption for cold things, and ramps up consumption for hot things. You get all the multicore Anna performance you want, but you don’t pay for what you don’t need.

Just to throw out some numbers, we measured Anna providing 355x the performance of DynamoDB for the dollar. No, I don’t think that is because AWS is earning a 355x margin on DynamoDB! The issue is that Anna is now orders of magnitude more efficient than competing systems, in addition to being orders of magnitude faster.

Chenggang Wu and Vikram Sreekanti were the driving forces behind this most recent work. They have more to say in their blog post, in the research paper on arXiv, and in the code on github.

(PS: the Anna v0 paper at ICDE was honored with a Best of Conference selection for publication in longer form in IEEE TKDE. Congrats to Chenggang Wu on that!)

In addition to being fast and lightweight, Anna now wins big on efficiency due to selective replication, multitiering and autoscaling.

http://databeta.wordpress.com/?p=891

Extensions

Anna: A Crazy Fast, Super-Scalable, Flexibly Consistent KVS

jmh Mar 9, 2018

There’s fast and there’s fast. This post is about Anna*, a key/value database design from our team at Berkeley that’s got phenomenal speed and buttery smooth scaling, with an unprecedented range of consistency guarantees. Details are in our upcoming ICDE18 paper on Anna. Conventional wisdom (or at least Jeff Dean wisdom) says that you have […]

Show full content

ANHU-1024x768-Alan-Vernon There’s fast and there’s fast. This post is about Anna*, a key/value database design from our team at Berkeley that’s got phenomenal speed and buttery smooth scaling, with an unprecedented range of consistency guarantees. Details are in our upcoming ICDE18 paper on Anna.

Conventional wisdom (or at least Jeff Dean wisdom) says that you have to redesign your system every time you scale by 10x. As researchers, we asked the counter-cultural question:

What would it take to build a key-value store that would excel across many orders of magnitude of scale, from a single multicore box to the global cloud?

Turns out this kind of curiosity can lead to a system with pretty interesting practical implications.

The key design point for answering our question centered on an ongoing theme in my research group over the last many years: designing distributed systems that avoid coordination. We’ve developed fundamental theory (the CALM Theorem), language design (Bloom), program checkers (Blazes), and transactional protocols (HATs, Invariant Confluence). But until now we hadn’t demonstrated the kind of performance and scale these principles can achieve across both multicore and cloud environments. Consider that domino fallen.

Side-note: Here’s what you need to know about coordination. In every computer — from a multicore chip on your phone to a cloud data center — many threads of execution are running at once. Almost every piece of software you run wastes enormous time coordinating with other threads to “leave its swimlane” … usually to modify bits of shared data. If each thread would just “stay in its swimlane”, all the threads would run at full speed. Don’t let anybody tell you otherwise (including peddlers of so-called “lock-free” data structures): updating shared data requires coordination and is a #GoSlow button for your software.

Anna offers world-beating speed at many scales. The paper includes numbers showing it beating Redis by over 10x on a single AWS instance, and beating Cassandra by 10x across the globe on a standard interactive benchmark. To get down to the details, we also benchmarked Anna against stronger contenders on a single-node batch request benchmark, to really see how fast it could go at its core task of puts and gets. Here Anna beats the pants off the competition, including comparable “state of the art” performance-oriented KVS systems: it was up to 700x faster than Masstree, up to 800x Intel’s “lock-free” TBB hash table. In fairness, those systems provide linearizable consistency and Anna does not. But Anna was still up to 126x faster than a “hogwild”-style completely inconsistent C++ hashtable due to cache locality for private state, while providing quite attractive coordination-free consistency. And when you want to scale up to the cloud (which Anna does but those systems cannot), you can’t realistically maintain linearizability anyhow. More on Anna’s consistency models in a moment.

Anna gets its performance and its scalability from its fully coordination-free implementation of simple actors with private state, which only communicate† via background gossip. Essentially it’s a distributed system deployed across cores as well as nodes. Coordination-freeness keeps every actor in its swimlane doing useful work: in high contention workloads we see 90% of Anna’s cycles going toward serving requests. For the same workloads, systems like Masstree and Intel TBB get well below 10% of their cycles serving requests—that’s because they spend over 90% of their waiting on coordination. However, even for low-contention workloads those systems suffer from processor cache misses due to shared memory.

I like Anna’s speed, but what’s also interesting is the palette of degrees of consistency Anna can achieve at that speed. A couple years back, we published the HATs work showing that there is a rich space of distributed consistency and transactional isolation that can (in principle) be achieved coordination-free. This includes fairly rich things like causal consistency or Read Committed transactions. We get this rich consistency in Anna with a very clean codebase, by porting design patterns of monotone lattice composition from Bloom to C++. The state of each Anna actor is a monotone lattice composition. Anna is the first system to offer all these consistency levels, and the various choices differ in only a couple dozen lines of C++ each. And thank goodness—because simplicity is key to this kind of speed.

Anna is a prototype and we learned a ton doing it. I think the lessons of what we did apply well beyond key-value databases to any distributed system that manages internal state—basically everything. We’re now actively working on an extended system, codename Bedrock, based on Anna. Bedrock will provide a hands-off, cost-effective version of this design in the cloud, which we’ll be open-sourcing and supporting more aggressively. Watch this space!

Credits:

Chenggang Wu was the fearless leader and key developer on Anna; Jose Faleiro and I were involved in the design. Props to Chenggang!
Thanks to old friends Peter Bailis (of HATs fame) and Neil Conway (of Bloom lattice fame) for feedback on this post, which builds directly on their influential earlier work!

* A native of California, Anna’s Hummingbird is the fastest animal alive relative to its size—a record established in prior research from UC Berkeley.

† One thing we did not have to think about much in Anna was fast asynch messaging across cores and nodes—we got this from a lightweight usage of 0MQ, which is fabulous. Hats off to the 0MQ team!

http://databeta.wordpress.com/?p=879

Extensions

Metadata Megafail: Messing up Your Data Strategy in 3 Easy Steps

jmh Feb 27, 2017

Requirements for metadata have changed drastically in recent years. How should data-driven organizations respond? Let me offer some pointedly lousy ideas.

Show full content

Cross-posted from the Berkeley RISELab Blog; comments should go there…

A key aspect of the RISELab agenda is to aggressively harness data—lots of it, both historical and live. Of course bits in computers don’t provide value on their own. We need a broader context for data: where it came from, what it represents, and how it gets used. Traditionally, people called this metadata: the data about our data.

Requirements for metadata have changed drastically in recent years in response to technology trends. There’s an emerging groundswell to address these new requirements and explore new opportunities. This includes our work on the broader notion of data context in the Ground system.

How should data-driven organizations respond to these changing requirements? In the tradition of Berkeley advice like how to build a bad research center and how to give a bad talk, let me offer some pointedly lousy ideas for a future-facing metadata strategy. Hopefully by reading this and doing the opposite, you’ll be walking the path toward a healthy metadata future.

Without further ado, here are 3 easy steps to Megafail with Metadata.

3 Steps to Metadata Megafail
Step 1. Metadata First

Reject any data that doesn’t come well-prepared with metadata in advance.

There’s a famous old slogan in computing: “Garbage In, Garbage Out”, a.k.a. #GIGO. What a great slogan! Wikipedia dates it back to 1957, and like many ideas from back then, it’s time to bring it back.

How can you tell if you have garbage data coming in? It breaks the rules on data format, content or sourcing. Which means it’s critical to formalize those rules in advance of loading any data, a strategy I like to call #MetadataFirst.

What’s that you say? 99% of your data arrives without meaningful metadata? My goodness—stay vigilant! If you have data with any whiff of garbage, then by all means throw that data away.

Now you might say that it’s a lot cheaper to store data now than it was in 1957. So why not keep data around even if it’s dirty and poorly-described? And you might say that analysts armed with AI-assisted data wrangling and analytics techniques can extract valuable signals from noisy raw data. So why not “figure out” useful metadata over time

Sure, you might say those things. But you shouldn’t! Because really… can you guarantee that those “signals” you’re extracting are true? You can’t! There are no airtight guarantees to be had unless they were enforced at the time of data collection. Because if it was garbage coming in, well … you know the saying.

#GIGO #MetadataFirst!

Step 2. Lock it up, Lock it in.

Deploy a metadata system that is as inflexible and proprietary as possible.

As we know, metadata is a critical asset: the key to enabling people to discover relevant data, assess its content, and figure out how to make use of it. A critical asset like metadata should not be left lying around. It should be locked away in a safe place, where it is difficult to access. I call that place #MetaJail.

You’re probably wondering how you can make a great metajail—one that renders your metadata really inaccessible. I’m here to help.

To begin, a good metajail should be prescriptive: it should impose a complex model that all metadata must conform to. This ensures that a diverse range of organizations—from biolabs to banks to bureaucracies—can all have an equally difficult time coercing their metadata into the provided model.

A great metajail is not only prescriptive, it’s proprietary. Vendor-specific solutions ensure both that metadata is locked up, and that the larger organization is locked in to the vendor as well. Even better is to put your metajail in a proprietary smokestack. Choose a vendor whose metajail only works with their other components: prep, ingest, storage, analytics, charting, etc. This is like a MetaPenalColony!

I hope it goes without saying that you should be the warden of the metajail. Wherever possible, require manual approvals for authorization. Then people who need to use data will have to go through you to get access. This maximizes your control over metadata, and hence your control over data, and hence your control over people! All of which increases your job security and keeps change at bay.

#MetaJail for the #MegaFail!

Step 3. One Truth

Ensure that the metadata enforces a single interpretation of data.

Traditional metadata solutions are often characterized as the Single Source of Truth for an organization. This is another great slogan from the past. #SSOT!

I like everything about this slogan. “SS” gets things started on an aggressive note, emphasizing central control in a single source (#MetaJail!) and rejection of any unapproved sources (#GIGO!).

But this step is not just a rehash of the previous two; it adds the final T: “Truth”. Now there’s a word that we don’t hear enough of in modern computing. The terrific thing about managing the Truth is that we get to define it. Remember, organizations are trying to do more and more with data, across an increasing number of use cases. Resist this trend! It is mayhem in the making. Consider a humble log file from a webserver. The Marketing department uses it to optimize the website layout. The IT department wants to use it to understand and predict server downtimes. The Online Services department wants to use it to build a product recommender system. Each one of these groups wants to extract different features and schemas from the log file, resulting in different metadata.

As you may know, Chief Marketing Officers supposedly spend more on IT than CIOs. (Which means, one assumes, that Marketing should control the Truth for your organization.) Hence in our example, the metadata describing the web logs should be defined by the Marketing department, and the log files should be transformed to suit that use case. Other representations of that information should be rejected as metadata Falsehoods.

So mandate a #SSOT, and manage Truth. Ask yourself: are you #SSoTOrNot?

Better Advice

At the risk of stating the obvious … all of the above steps are terrible ideas. But sometimes it’s good to rule some things out before we decide what we should do.

Want to know more about our serious ideas for metadata management? Have a look at our initial paper on Ground from CIDR 2017. As a brief excerpt, here are some design requirements for evaluating a modern metadata or data context system. It should be:

Model-Agnostic. For broad adoption and usage, a data context system cannot impose opinions on metadata modeling.
Relativistic. It should be possible to record multiple interpretations of the same data, allowing users and applications to agree to disagree.
Politically Neutral. The system should not be controlled by a single vendor or interest group. Like Internet protocols, data context is a “narrow waist”. To succeed, it must interoperate with a wide range of services and systems from third parties, some of which have yet to be born.
Immutable. Data context must be immutable and versioned; updating stored context in place is tantamount to erasing history.
Scalable. It is a common misconception that metadata is small. In many settings, the data context is far larger than the data itself.

Ground’s design is based on collaborations with colleagues from the field including Awake Networks, Capital One, Cloudera, Dataguise, LinkedIn, SkyHigh Networks and Trifacta. Slides from the conference talk are available as well.

http://databeta.wordpress.com/?p=873

Extensions

Lineage-Driven Fault Injection at Netflix

jmh Feb 4, 2016

As mentioned in my previous post, Peter Alvaro turned in his PhD thesis a month back, and is now in full swing as a professor at UC Santa Cruz. In the midst of that nifty academic accomplishment, he succeeded in taking the last chapter of his thesis from our BOOM project out of the ivory tower and into […]

Show full content

As mentioned in my previous post, Peter Alvaro turned in his PhD thesis a month back, and is now in full swing as a professor at UC Santa Cruz. In the midst of that nifty academic accomplishment, he succeeded in taking the last chapter of his thesis from our BOOM project out of the ivory tower and into use at Netflix. Peter’s collaborators at Netflix recently blogged about the use of his ideas alongside their (very impressive) testing infrastructure, famously known as ChaosMonkey, a.k.a. the Netflix Simian Army. This also generated some press, vaguely inappropriate headline and all. Their work raised the flag on five potential failure scenarios in a single request interface. That’s nice. Even nicer is what would have happened without the ideas from Peter’s research:

Brute force exploration of this space would take 2^100 iterations (roughly 1 with 30 zeros following), whereas our approach was able to explore it in ~200 experiments.

It always feels good when your good ideas carve 28 zeroes of performance off the end of standard practice.

I encourage you to read the Netflix blog post, as well as Peter’s paper on the research prototype, Molly. Or you can watch his RICON 2015 keynote on the subject.

http://databeta.wordpress.com/?p=637

Extensions

The Professors Peter, A and B

jmh Feb 1, 2016

It’s been a while since I’ve taken the time to write a blog post here. If there’s one topic that deserves a catchup post in the last few months, it’s the end of an era for my former students Peter Alvaro and Peter Bailis—henceforth Professor Peter A of UC Santa Cruz, and Professor Peter B of Stanford. Each of them officially turned […]

Show full content

berkeleysun It’s been a while since I’ve taken the time to write a blog post here. If there’s one topic that deserves a catchup post in the last few months, it’s the end of an era for my former students Peter Alvaro and Peter Bailis—henceforth Professor Peter A of UC Santa Cruz, and Professor Peter B of Stanford. Each of them officially turned in their dissertation in December. Both spanned an impressive range of theory, practice and genuine applicability. There’s tons of good stuff in each document. Here’s a bit of an overview with references for those of you who might not be diving in to read them cover-to-cover.

Peter Alvaro was a pillar of my BOOM research project from its inception. His thesis is entitled Data-centric Programming for Distributed Systems, and it covers a beautiful arc of results:

It starts with his insightful exploration of commit and consensus protocols in a declarative language, and his collaboration on the BOOM Analytics work that build a ridiculously high-function HDFS clone in ridiculously few lines of code and hours of developer time
It also includes his foundational design of the Dedalus logic for distributed programming that has become a touchstone for the database theory community, in addition to our team
and his contributions to the Bloom language including the core semantics and many pragmatic features
It covers in depth his work on the Blazes system for analyzing eventual consistency at the level of program semantics both for Bloom and for dataflow languages like Storm, and automatically synthesizing coordination code where needed including a high-performance solution called sealing
and finally it presents his work on Lineage Driven Fault Injection (LDFI) and the Molly prototype, which extracted new benefits from declarative programming in large-scale testing, and was recently adapted for use at Netflix.

The thesis leaves out a bunch of additional work he did at Berkeley, including contributions to the much-cited MapReduce Online effort, and his work on distributed system testing with BloomUnit. But what I’ll remember most from his graduate years is the team-teaching we did on Programming the Cloud, where we used our work on Bloom to get undergraduates learning the fundamentals of distributed systems via live coding. This was without question the most creative and audacious teaching I’ve been involved with, and it worked surprisingly well thanks in large part to Peter’s hard work and more importantly his warm and thoughtful spirit. I’m excited to see Peter A teaching it again this coming quarter at UC Santa Cruz.

Peter Bailis’ thesis is called Coordination Avoidance in Distributed Databases, and it’s a timely tour de force of fertile ideas found in what many considered a picked-over wasteland—transaction processing. Peter’s thesis includes a range of big ideas married to practical observations, including:

An empirical level-set on the costs of coordination in modern distributed databases.
The notion of Invariant Confluence, which attacks the distributed database problem of consistency without coordination by taking Church-Rosser graphs and applying them to databases with invariants.
An analysis of Invariant Confluence in the wild, via mining Github repos with Ruby on Rails apps to determine how “real” programmers tradeoff application constraints and database constraints. Not only did Peter do the legwork here to understand what programmers do, he brought it home to force us all to ask the questions of why they do what they do, and how the push and pull of technical communities can lead to better outcomes.
A new and very sensible (if you’re into that kind of thing) weak isolation level for transactions called Read Atomic, with a range of efficient implementations for distributed systems via RAMP protocols.

Peter B’s thesis also leaves out a range of important work he did at Berkeley, including the popular PBS statistical-empirical explanation of why NoSQL stores seem to work, his bolt-on causal consistency work and analysis, the initial design of the Velox model-serving system with colleagues in the AMPLab, and his popular shaming of the SQL transaction world by exposing how few SQL systems provide ACID transactions by default (or at all). I remember with gratitude how Peter took on half the work of teaching graduate databases at Berkeley (the first offering in years!) while I was deeply involved in running Trifacta. And finally, it has been a bracing dose of research and academic politics having him join in the latest edition of Readings in Database Systems; he did it with grace and intelligence.

Without question the best part of teaching at Berkeley is the students you get to work with. Peter & Peter: it has been a great pleasure. I suspect that being colleagues could be even more fun. Good to have you both still in town!

http://databeta.wordpress.com/?p=672

Extensions