Query exabytes of data in AWS with Looker’s native support of Amazon Redshift Spectrum
May 18, 2017
At the AWS Summit on Wednesday, April 19th, 2017, Amazon announced a revolutionary new Redshift feature called Spectrum. Spectrum significantly extends the functionality and ease of use of Redshift by allowing users to access exabytes of data stored in S3 without having to load it into Redshift first.
Based on the resounding cheers of the crowd at the Summit and early interest from Looker customers, it’s a feature that people are pretty excited about. Looker customers on Redshift can take advantage of the feature today to maximize the impact of both technologies.
To highlight the speed and power of Spectrum during his Keynote speech, AWS CTO Werner Vogels compared the performance of a complex data warehouse query running across an exabyte of data (approx. 1 billion GB) in Hive on a 1,000 node cluster versus running the same query on Redshift Spectrum. The query would have taken 5 years to complete in Hive, and only 155 seconds with Spectrum at a cost of a few hundred dollars.
In addition to the obvious convenience factor of being able to directly access data stored in S3, the ability to query this data directly from Redshift in S3 allows Redshift users to access an exceptional amount of data at an unprecedented rate. To boot, this will lower costs for users while giving them more granular data than ever before. Prior to Spectrum, you were limited to the storage and compute resources that had been dedicated to your Redshift cluster. Now, Spectrum provides federated queries for all of your data stored in S3 and dynamically allocates the necessary hardware based on the requirements of the query, insuring query times stay low, while data volume can continue to grow. Perhaps most importantly, taking advantage of the new Spectrum feature is a seamless experience for end-users; they do not even need to know whether the query they ran is executed against Redshift, S3 or both.
Other benefits include support for open, common data types including CSV/TSV, Parquet, SequenceFile, and RCFile. Files can even be compressed using GZip or Snappy, with other data types and compression methods in the works.
What this means for Lookers
Spectrum will allow Looker users to dramatically increase the depth and breadth of the data that they are able to analyze. Extremely complex queries can now be run over vast amounts of data at unprecedented scale.
Looker’s native integration, combined with Redshift Spectrum offers the possibility of near-infinitely scalable data lake/data warehouse, exciting new modeling opportunities, and expanded insights for businesses. Data stored in Redshift and S3 can be modeled together, then queried simultaneously from Looker. For example, extremely large datasets, or datasets that are subject to extremely complex queries, can be stored in S3 and take advantage of the processing power of Spectrum. The new feature will also allow for hybrid data models, where newer data is stored in Redshift and historical data is stored in S3, or where dimension tables and summarized fact tables are stored in Redshift, while the underlying raw data is stored in S3. A well designed data storage architecture, combined with the power of a Looker model, will allow Looker users to easily traverse between data aggregated from terabytes of raw data stored in the original Redshift cluster, down to the individual events that comprise the aggregations that live in S3.
With this functionality users have access to more data, deeper insights and blazing fast performance from one data platform.
Pricing for data stored in and queried against your existing, relational Redshift cluster will not change. Queries against data that is stored in S3 will be charged on a per-query basis at a cost of $5 per terabyte. Amazon is providing several recommendations on how the data in S3 should be stored that will minimize the per-query costs. These same recommendations also maximize query performance.
Spectrum gives customers unparalleled ability to leverage their data. As companies are collecting more and more data, they need ways to store and process that data quickly and cost effectively and Spectrum is an elegant solution to that problem. Users will always ask to dig deeper, and databases like Redshift with Looker on top allow companies to store, process and expose the sheer quantity of data required to enable that. It’s interesting to watch, and I look forward to seeing how Redshift continues to innovate in this rapidly evolving market.
Read more about Spectrum