What is BigQuery?
BigQuery is a data warehouse that leverages the massive scale of the Google Cloud architecture to distribute data across thousands of nodes, utilizing as many nodes as are needed to run any query performantly. Unlike other database dialects, where you buy or rent individual machines or space on machines, there is only one instance of BigQuery, comprised of thousands of nodes, shared by all users of the instance.
The enormous scale enables BigQuery to run even enormous, complex queries in a relatively short time. As a result, even as your datasets grow from gigabytes to petabytes, BigQuery will remain responsive.
Reasons to choose BigQuery
Fast - Queries on BigQuery are never slow, regardless of the size of the data or the complexity of the query, because BigQuery’s architecture distributes the workload evenly across as many nodes as necessary.
Scalable - BigQuery is built on Google’s Cloud Storage Platform, which is designed to scale seamlessly to petabyte-level queries with all the reprovisioning of the clusters handled by Google, with no tweaking required.
Minimal management - BigQuery is extremely easy to get started with and maintain because the entire BigQuery instance is managed for you. It's as simple as uploading data through the Google Cloud Web UI and running queries.
What use cases is BigQuery really great for?
Companies with large and growing data
Loading data into BigQuery is free, and storing data is quite inexpensive. This makes BigQuery attractive to companies that find their data volumes growing rapidly. And because there is effectively no limit to the amount of data that can be stored or processed within BigQuery, querying a petabyte of data on BigQuery is as simple as querying a megabyte.
Teams with limited DevOps resources
Because Google handles all of BigQuery’s hardware and query optimization, BigQuery requires basically no maintenance. You just load your data and start querying.
Organizations with spiky workloads
BigQuery’s pricing is on a pay-per-query basis. That means that you only pay for what you use. So if you need massive power one day and very little the next day, BigQuery is likely a great option.
Before you commit
BigQuery charges a small price for data storage and a separate price based on the data scanned while querying (which is generally the larger portion of the bill). This means that it is relatively cheap to store large datasets in BigQuery, even if they’re queried infrequently. The real costs are incurred by using that data.
BigQuery will provide an estimate as to how much data will be scanned before a query runs, but in some cases these estimates can be off, leading to unexpected charges. Cost-controls are the most reliable way to avoid surprises on your bill, since they will prevent you from inadvertently running queries that scan a lot of data or are very complex (and thus more costly).
One concern that prospective companies have about BigQuery’s pricing model is the unpredictability of billing cycles since the price of the service is dependent on usage.
For this reason, BigQuery allows accounts to opt into a pricing feature known as Cost Controls, which sets a predefined quota on the number of bytes processed each day with the service. These limits can be set at a project level or applied across all users.
When a quota is set in BigQuery, the total amount available for the day is partitioned into smaller increments that are refilled every 15 minutes.
For example, if I set a limit of 10 TBs per day, I receive 100 GBs to use every 15 minutes. This means that I start the day with 100GB, and if I don’t run any queries in 15 minutes, I will receive another 100 GB, bringing the available amount of data I can process in BigQuery up to 200 GB. If at some point during the day I exhaust the quota completely, I’ll need to wait 15 minutes to receive an additional 100 GBs to run my queries.
For larger accounts that don’t want to enforce quotas, but also require a predictable billing model, BigQuery offers Flat-Rate Pricing, which allocates a predefined number of seats which receive the ability to run unlimited queries for no additional charge. Additional seats can be added for a flat rate as well.
Flat-rate customers are still charged for storage costs of their data, the flat-rate only applies to costs associated with querying the data.
Flat-rate pricing really only makes sense for larger enterprise accounts that have a large number of users regularly making lots of large, expensive queries against the database.
Trying it for free
BigQuery is included in Google Cloud Platform’s Free Tier, that provides prospective customers with $300 to spend over a 12-month timeframe on any Google Cloud product. This is in addition to the 1 free TB per month of data processed and 10GB of free storage in BigQuery.
BigQuery comes with a large set of pre-loaded public datasets that new users of the service have immediate access to, and can use to test out the capabilities of the service. The types of data in these public datasets ranges from government data (San Francisco Fire Department Service Calls) to enterprise data (Github Site-wide User Activity Data) to sports data (Major League Baseball Data).
You can also see how BigQuery performs with Looker on top on the following posts:
BigQuery database architecture
Unlike other database offerings, which allow consumers to rent or purchase a private instance for themselves that consists of however many machines they choose, BigQuery is one massive instance supported by hundreds of thousands of machines.
This is important because no individual account owns or has access to any individual machine in BigQuery. Rather, the entire processing power of the instance is rented out for at seconds at a time whenever queries are run (a difference in functionality that is reflected in a difference in pricing models between BigQuery and other types of MPP databases). Because BigQuery has the entire instance available to devote to queries at any time, queries return quickly and consistently, regardless of how large or complex the query is.
The Google Cloud team wrote a great post that describes BigQuery’s architecture in more detail, explaining in depth how different pieces of the stack interact to create a seamless experience for end users. An important takeaway from the post is that much of the software in the BigQuery stack is the same technology that Google uses for a variety of other offerings, like Gmail, Youtube, and Search, which means Google as a company is invested in consistently improving the performance and functionality of the software within the BigQuery data stack.
Types of data
BigQuery supports CSV, JSON, Avro, and Cloud Datastore backups. BigQuery can also treat Google Sheets as a table. More detail on supported data formats in BigQuery can be found here.
An interesting feature of BigQuery is its support for nested records within tables, which are essentially pre-joined tables within BigQuery. Tables containing nested records can be ideal for conceptually hierarchical data (like orders and items or sessions and pageviews) and offers several interesting new ways to model data. You can read about how to construct nested records within a BigQuery table from Looker co-founder and CTO Lloyd Tabb here.
Maximum recommended size of data
BigQuery can handle and comfortably query petabytes of data in a single query, but the entire architecture of BigQuery is designed to be close to infinitely scalable. Most BigQuery projects are allocated 2,000 “slots” so while large table scans are its bread and butter, you can run intro resource constraints when running complex queries that involve large table JOINs.
Because there are no machines that you need to manage yourself with BigQuery and no schema design decisions to make, getting started with the BigQuery database service is as simple as creating an account with Google Cloud Platform, loading a table through the web UI, and running a query.
Process for new data
You can load data from several source formats, including CSV, JSON, Avro, and Google Cloud Datastore backup files.
For assistance building out ETL pipelines, you can utilize an ETL partner that integrates with BigQuery, such as Fivetran, Stitch and Matillion.
BigQuery requires very little maintenance, since Google manages most everything for you. Since you share a single massive instance, there is no need to optimize clusters for data size, manually redistribute data for query execution speed, or fine tune query plans to achieve the best possible results.
However, there are choices you can make to reduce query costs, most specifically by partitioning your tables in one of several ways.