I thought it would be interesting to look back at the past year or so of advances in interactive analytics on Hadoop. So I went through a bunch of release notes, JIRAs, and press releases in an attempt to summarize them in one blog post.
There has been pent-up demand for the ability to do analytics in-cluster since at least mid-2014, when Gartner reported that 53% of Hadoop users were using interactive SQL. The nearest most recent similar statistic that I could find was from Databricks's 2015 Spark Survey Report, which reported that 69% of Spark users are using Spark SQL, making it the most popular part of its API.
In late 2014/early 2015, the landscape for doing interactive analytics in Hadoop was very different than it is today. Here’s where things stood back then:
In 2014 and most of 2015 I talked to a lot of Looker customers and prospective customers who were using a Hadoop cluster. Almost all of them were using their cluster for collecting and transforming data, but were then exporting it into a "real" database for analysis—like Redshift or Vertica, or for smaller datasets MySQL or PostgreSQL.
They told me they would have preferred to do all of their analysis in-cluster, without this extra exporting step, but they couldn't—whether due to limitations in what their Hadoop distributions or cloud provider could support, lack of features or immaturity in SQL support, or a general lack of awareness of what was available.
But despite the wasted efforts of each vendor focusing on its own pet project, over the past year or so, SQL on Hadoop has matured to a point where it currently is a viable platform for doing in-cluster analytical querying. Let’s take a look at some of the solutions that Looker supports.
Back in October 2014, Impala 2.0 was a major release and jump forward. It allowed joins and aggregates could spill to disk instead of crashing when running out of memory. With improved subquery support and the addition of window functions, Impala also became much more useful for analysis.
Later releases increased compatibility by adding math, string, date/time, and bit functions used by other dialects. This made porting existing code to Impala a bit easier, while the most recent release added the complex types STRUCT, ARRAY, and MAP that are already found in Hive.
Compatibility with enterprise and cloud storage was enhanced by certification with EMC's Isilon and increased compatibility with Amazon's S3, respectively.
Security was also enhanced, adding the ability to restrict access at the column level as well as redact personal information from log files. And for data scientists, a new Python data analysis framework called Ibis was released in July 2015.
Also, if you believe Cloudera's benchmarks (and why wouldn't you?), speed and concurrency improved a bunch.
The Hive 0.14 release reduced queries to "human interactive" response times of approximately 5 seconds, allowing Hortonworks to focus on the next phase of the Stinger project. They accomplished three main goals: 1) allow INSERTs, UPDATESs, and DELETEs; 2) achieve sub-second response times; and 3) support SQL:2011 standard query syntax.
The presentations and benchmarks look very promising, but it will take a while for the work to make its way through to all of the Hadoop distributions and hosted Hadoop providers. So keep an eye out for Hive updates in release notes.
With the 1.1 release Hive added a third execution engine in addition to MapReduce and Tez: It now has the ability to run on Spark. Good news for those of us who miss Shark.
In February we said goodbye to the original hiveserver that was single-threaded and could only run as a single user. Good riddance.
Tez, the Hive execution engine underneath Stinger, had two main releases in 2015: one in January that focused on the Tez UI, which is difficult to install; and one in May that focused on performance at stability. The rest of the releases were bug fixes.
Spark chugged along with roughly one release per quarter, each one adding useful features and/or speed enhancements.
At the end of 2014, Spark beat Hadoop MapReduce in a 100TB sort bakeoff, showing that while Spark can work well in other clustered environments like YARN/Hadoop or with Mesos, it is very efficient as a standalone compute framework as well.
Spark SQL graduated from alpha in March 2015 with its 1.3 release. This is the same release that introduced Spark packages, which allows access to many other data sources from CSV to Avro to several NoSQL data stores. The last time I checked, there were 186 packages listed on spark-packages.org.
Project Tungsten added a lot of low-level performance enhancements to Spark as it was rolled out across several releases in 2015. By using its own object memory layout (instead of Java's), Spark is able to use memory more efficiently, join datasets more quickly, and reduce garbage collection time. Tungsten also uses code generation, similar to Impala's use of LLVM, to speed up loops and common tasks, such as date calculations and manipulating strings.
The latest 1.6 release of Spark now has automated memory management, which means no longer having to manually guess at how much memory to reserve for each Spark task.
Similar to Impala, Spark improved compatibility by adding many date, time, string, and math functions—over 100 of them. Window functions were also added. And Spark added compatibility with pretty much any version of Hive's metastore, making it easier to read tables written by Hive.
Lack of commercial support for Presto is no longer an issue after 2015. In June, Teradata announced that they were diverting part of their development team and consulting army towards supporting Presto. In the same announcement they mentioned that Presto was being used on petabytes of data at AirBnB, DropBox, Groupon, Netflix, and Facebook.
Use of Presto with existing clusters was made easier with the release of Presto as RPM packages and versions certified for Cloudera's and Hortonworks's distributions. And it is available as a standard application in Amazon's, Treasure Data's, and Qubole's services.
Presto has an even more active release cycle than Spark, so there have been dozens of releases over the past year. These releases brought a lot of new features to Presto, notably more control over memory management, improvements in reading ORC and Parquet files, speed enhancements for joins and string filtering, Unicode support, expanded support for additional types of joins, and awareness of cluster network topology when planning queries.
Like the other SQL technologies, Presto added multiple functions to make it more compatible with other dialects, including Teradata, of course. Its support for Hive tables also added INSERT, DELETE, and CREATE with support for partitioned tables.
Out of all of the implementations I work with that are running analytics with SQL on Hadoop, I see the most complex SQL being used with Presto.
Also its ability to combine data across data sources leads to an interesting use case where event data is stored cheaply in-cluster, and joined with lookup tables that are stored in another database like Redshift or PostgreSQL. I hope that Presto will support full joins across different data sources in the near future, which would lead to more interesting use cases.
ORC and Parquet became top-level Apache projects in April 2005, reflecting the importance of columnar file formats to the Hadoop ecosystem in general and SQL on Hadoop in particular. Support in these file formats for things like hasNull flags, Min/Max records, and Bloom filters, as well as advances in compression, has contributed a lot in speeding up SQL on Hadoop to the point of making it truly interactive. 100s of petabytes of structured data are being stored in these file formats now.
In September 2015, Cloudera launched Kudu, a scalable file system designed from the ground up for high-speed analytics, that they have been secretly working on for several years. Kudu will get around some of the limitations of the append-only nature of HDFS. It's still in beta, but early benchmarks of Impala on Kudu look very promising.
And in November 2015 the Transaction Processing Performance Council (TPC) launched a benchmarking standard for measuring performance of decision support solutions on "SQL-based Big Data systems" called TPC-DS 2.0. Database sizes can range from 1TB to 100TB. My hope is that this will finally lead to meaningful benchmarks being published in 2016 comparing Impala, Hive, Spark SQL, Presto, and others.
A lot of progress has been made in interactive analytics on Hadoop over the past year or so. After reading through the various release notes and JIRAs, I noticed these general trends across all SQL on Hadoop technologies:
Looker is looking forward to helping enable interactive analytics on Hadoop and Spark clusters in 2016 and beyond.
Disclaimer: I used to work at Cloudera