How top engineering organizations build their big data stacks
Dec 7, 2015
Today, technology companies are pushing the boundaries of using data to build products and optimize businesses. These companies have been built from the ground up with data at their core. Often, they rely on data—and sophisticated data infrastructures—for their products and revenue models to function.
Amazon, Netflix, Pandora, and Spotify are famous for their use of algorithms to determine customer preferences and serve up recommendations for songs, movies, and products. Uber and Lyft have built an ever-changing, at-scale solution to the traveling salesman problem that all CS101 students know and love. And Buzzfeed is in the process of disrupting 150 years of journalism on the back of social sharing, native advertising, and analytics.
This article examines how eleven different high-growth technology companies use data to power their businesses by reading what they themselves have written. It has become a recent trend for companies looking to attract top-tier engineering talent to write behind-the-scenes blog posts on various technical topics, and posts on data infrastructure in particular are de rigeur. The eleven companies included in this article are as follows:
These posts are all “success-biased”—no one has chosen to write about their data infrastructure screw ups—but they contain many useful insights from some of the smartest people commercializing this type of technology today. What follows is the most interesting trends when looking at the group as a whole.
Data infrastructures have two primary use cases
Each of these companies is building an infrastructure primarily to support either business analytics or delivery of data-enabled product features.
Delivery of data-enabled product features.Frequently, data infrastructure is used to power product features. Braintree uses its data infrastructure to deliver real-time fraud detection. Pinterest uses its data infrastructure to deliver analytics to its advertisers. Netflix and Spotify famously use their data infrastructures to power content recommendation algorithms.
Business analytics.Companies use data infrastructure to power business analytics. Seatgeek, Looker, and Asana use data for funnel analysis, A/B testing, marketing optimization and more.
These two uses for data heavily drove technology choices made throughout the pipeline. In general, companies like Spotify, Netflix, Metamarkets, and Pinterest that were heavily focused on using data to deliver product features had very specific and technical requirements for their data infrastructures. This pulled them towards technologies like Spark, Pig, and Hive. Companies that used their infrastructures to support business analytics primarily use SQL-based tools paired with columnar data stores.
Insights come from integrated data
Asana’s Justin Krause said it best:
“First party owned data is simply the only way to achieve true business intelligence – if we can’t join data from different sources, we can’t answer questions like:
- Is our marketing campaign delivering quality users?
Requires joining ad attribution data onto engagement data
- Are our customer success programs successfully driving revenue expansion?
Requires joining lists – probably in our CRM – with engagement and billing data
- Did our most important customer just hit a bad bug, and we need to reach out?
Requires joining error/bug logs onto customer data”
Most of these companies focused on two primary datasets: transactional data from production databases and user engagement data from event collectors. But some specifically highlighted pulling data from other sources as well. Many of these were marketing-focused: advertising, email, a/b testing, etc.
Companies are aware of three separate visualization needs
Sophisticated data consumers have begun to vocalize three separate needs within data visualization:
DashboardingThese companies have very specific requirements for dashboards and frequently have decided to build their own tools. Asana, especially, was very clear about these requirements: smoothing, annotation, parameterization, and more.
Interactive analyticsVisualization tools that support quick, iterative, collaborative data discovery. Looker was the primary tool cited by these companies, showing up in 4 out of 11 posts. In every case, usage of Looker was paired with usage of Redshift.
Stream analyticsUsed to monitor real time data streams for anomalies and trends so that immediate action can be taken. This was the least-cited need; Asana specifically calls out Interana and others have built custom solutions.
The separation of visualization into three discrete category hasn’t always been the case. It’s only in the recent past that visualization products have been built specifically to serve one of these use cases; in the past, visualization products were more general-purpose. We expect this trend to continue.
There is an overwhelming preference for build vs. buy in most of the stack
There are commercial software applications to handle every portion of the big data stack, but the profiled companies demonstrate a strong preference for deploying open source software paired with lots of custom code.
Most of the open source software used is maintained under the Apache Foundation, including Kafka, Spark, Hadoop, Pig, Hive, and more. And there is an significant preference for functional programming languages (Scala and Clojure) for data munging jobs and high-level languages (Python and R) for analysis.
There seem to be two areas where companies are willing to pay for commercial software. First, they’re more than happy to deploy infrastructure-as-a-service from cloud providers. Today, the cloud provider of choice is definitely Amazon, but the more important takeaway here is that IAAS is a deeply embedded infrastructure decision among innovative companies. Secondly, companies are ready to pay for commercial analytics applications. This shouldn’t be surprising, as open source is a common choice for infrastructure, while software with a heavy UI/UX element is more frequently served by commercial software.
It remains to be seen whether this preference for build vs. buy when it comes to big data technology is something that will be a common trend within the larger market. This may just be the result of these companies having gotten there first and finding that existing commercial products weren’t yet suited for their needs.
S3 and Redshift are dominant; Kafka usage is growing
There was heavy usage of the AWS throughout the other companies, most heavily focused on S3. 7 out of the 11 companies in this set used S3 as a part of their data infrastructure. Netflix provides an excellent rundown on its reasoning to use S3 as its primary data warehouse:
Firstly, S3 is designed for 99.999999999% durability and 99.99% availability of objects over a given year, and can sustain concurrent loss of data in two facilities. Secondly, S3 provides bucket versioning, which we use to protect against inadvertent data loss (e.g. if a developer errantly deletes some data, we can easily recover it). Thirdly, S3 is elastic, and provides practically “unlimited” size. We grew our data warehouse organically from a few hundred terabytes to petabytes without having to provision any storage resources in advance.
Zulily is an outlier, using Google’s cloud platform throughout its big data stack. In the companies we surveyed, Zulily was the only one to make heavy usage of Google’s cloud offerings. We didn’t find any companies using Microsoft’s cloud platform, but that could change once its Azure SQL Data Warehouse reaches general availability later this year.
7 out of 11 companies we profiles mentioned using an analytic data warehouse, for those who did, Redshift was by far the dominant choice. The single company not using Redshift was Zulily, who uses Google’s BigQuery. We didn’t actually come across any discussions of why a particular columnar database was chosen. We can only presume that companies either found these technologies highly substitutable and went with the default option (typically AWS) or there is wide acceptance that Redshift is the superior platform.
Kafka is the other technology whose usage we found noteworthy. A few years ago, Kafka was not widely deployed in building data pipelines. The Netflix and Spotify writeups are a good example of this: their pipelines are heavily based on batch jobs. But as companies have more diverse data needs and begin to focus on real-time data, Kafka has become a foundational technology. Braintree, Metamarkets, and Pinterest all use Kafka as a core part of their data infrastructures.
Data infrastructure has geek cred
Beyond the specific implementation details, it’s clear that companies have a pain point around data infrastructure. If they didn’t, they wouldn’t have collectively spent many thousands of hours working on and blogging about solutions to that problem. That, in and of itself, was interesting and valuable information for us.
But pursue this train of thought further and you realize something else: big data tech is hot with software engineers. Amazing people at top companies have spent many hours chronicling their efforts; either they are very interested in the topic or they know their potential recruits are. Quoting Michael Erasmus from Buffer:
“Building our new data architecture has been an amazing and fun adventure.”