I used to buy VHS tapes. It was great. I could buy my favorite movies and watch them whenever I wanted. Except most of the time I owned a movie, I wasn’t actually watching it. And when DVDs came out, I had to replace my whole collection to upgrade.
Then video rentals came along and I mostly stopped buying movies. Why pay to keep a movie on the shelf when I could access a vastly larger pool of movies whenever I wanted? Except, of course I was limited to one movie at a time, and sometimes the movie I wanted wasn’t available.
That’s why Netflix’s streaming model was such a revelation. Suddenly, I had access to an enormous catalog of movies in an instant and they were never out of stock. And if I wanted to binge watch, I could, without having to drive to the store to trade discs.
Movies transformed from a physical thing I bought, to a shared service I leased a fixed part of, to a utility that scales seamlessly to meet my current needs.
Analytic databases are following the same path.
At first, your only option was to buy a database of a given size, with given performance limitations. At certain times it got used a lot (closing the quarter or doing some big transformation). On other days it might not get used much at all. But not using it didn’t save you any money, because you’d paid up front. And you were stuck with the featureset and power you initially bought.
In the last five years, analytic databases moved to the cloud. That was wonderful, because you no longer had to worry about maintaining them and you could scale them up in a matter of hours. But you still were leasing a fixed part of the cloud and you still paid even when you weren’t using that part.
But now we have offerings like Google BigQuery, Snowflake, and AWS Redshift with Spectrum. They’re Netflix. They scale instantly in response to your current needs.
We call this new breed of warehouses “Elastic Analytical Databases”. These databases can scale up and down as needed depending on your workload. They share three key features that contribute to their elasticity:
Elastic Storage and Compute are Separate
Elastic databases are built on cloud storage systems like Google Cloud Storage or Amazon S3, which means they’re built to handle exabytes of data. This also means that you never need to worry about running out of storage. Because storage is relatively cheap (as compared to compute power), this makes it attractive to store all your data where the elastic database can access it.
Elastic Computing Power Scales (Almost) Infinitely
Elastic Databases have access to vast computing resources, so they can pull as many machines as are necessary to execute your workload. This all happens in a single database architecture--every query runs against the entire cluster. This means that elastic databases can return query results in a relatively consistent amount of time, no matter how large or complex the query is. Some architectures allow you to specify the size of cluster for a given workload, so you can trade speed and size, but just for a long as you need it to complete your workload.
Usage-based Model, Rather than Ownership-based Model
Most of these databases use permissioning, rather than physical separation, to keep different customers’ data separate. This allows providers to smooth demand by sharing power across many customers. So instead of paying to own your own instance (whether you’re currently using it or not), you pay to get access and use the instance for the time you need it.
Cloud storage that has the capacity to scale seamlessly no matter how much data you put in it is a key component of elastic databases. Whereas disk usage is a long-time headache for database admins, Amazon S3 and Google Cloud Storage ensure that you never hit capacity thresholds.
Not having to worry about running out of space is freeing. Just like managing Netflix’s infrastructure isn’t your responsibility, unlimited storage means you no longer need to focus on manually scaling up your cluster size with increased data volume. It lets you focus on doing good analysis, rather than capacity planning.
Elastic database are built to handle many workloads at once, so they have to scale out. Scaling up won’t work. That makes them very powerful. The same way that Google returns results to any question in seconds by hitting hundreds of machines in parallel, queries against an elastic database are similarly performant.
At the 2016 Google Cloud conference, I watched as a Googler ran a query against a petabyte of data. That’s the equivalent of ~110 consecutive years worth of movies. Querying that almost unthinkable amount of data took about two minutes. Similarly, a 4TB query run against a table with 100 billion rows stored in BigQuery returns in under 30 seconds.
The number of individual machines and amount of network bandwidth needed to run a similar query against a non-elastic database is massive, and this doesn’t even include the technical skill and resources needed to manage such a system. And when you’re managing those resources yourself, annoying things like failing servers are your problem. With an elastic database, Google or Amazon or Snowflake handles those failures gracefully in the background.
With an elastic database, the resources needed to run an enormous query are trivial because the system designed to run these databases is much, much larger. These database providers can loan you these vast resources for seconds at a time because they’re only a fraction of a fraction of the entire collection of machines that make up an elastic cluster.
The result is that your queries are never slow, and the cluster never goes down.
Unlike other analytical databases, you share only one instance of BigQuery or Athena with thousands of other customers. You dump your data into BigQuery or Athena, both of which are managed for you, and just start querying. No more spinning up or sizing your own instance. With Snowflake, you simply ask for as many machines as you need to perform your function, but only for as much time as you need it, essentially forming an on-demand army of machines that you can use for only the time you need.
With Netflix, you’re not paying for the movie itself. Rather, you’re paying to be a part of the Netflix network and use their services. With elastic databases such as BigQuery, Snowflake, and Athena, you’re not paying to own the database instance itself. Instead, you’re paying to be a part of the instance, and leverage their massive compute power for seconds at a time.
Today, if you have data in Salesforce, Zendesk or any other SaaS tool, you have to use their API to move that tool’s data into a database in order to query it. Some APIs are better than others, but by the time the data arrives it is out of date. How out of date the data is depends on the mechanism you are using but that delay can range from minutes to days.
As elastic databases take hold, I believe more and more of these SaaS tools will move to make their data available directly where these new databases can access them. Google’s recent introduction of BigQuery Data Transfer Service does exactly this. Just like sharing a Dropbox or Google Drive folder instead of downloading files to a USB drive, we’re moving to a world where data will be shared, rather than moved.
The data will therefore always be fresh and up to date. This instantaneous sharing stays constant no matter how big the data gets because it’s never moving.
We still don’t know how the data community will adapt to the new paradigm of elasticity. Will the benefits be enough to shift the data community towards these elastic databases? Or will unforeseen obstacles prevent them from taking hold?
At Looker, our unique architecture lets our customers immediately take advantage of these new developments in database technology to see if they’re right for them. It may be worth learning more about elastic databases to see if they’re the right fit for your organization; we think they’re pretty amazing.