Data analysis with Spark SQL
Feb 18, 2016
We’re excited to announce Looker’s support for Spark SQL. Since Looker enables data exploration and transformation by querying data where it lives - analytics directly benefits from Spark’s powerful compute engine and querying power. For all the reasons Spark has become a natural choice for large scale data processing, Looker can provide a complementary exploration and visualization layer because it brings analytics directly to the underlying Spark architecture.
Data analysis on Spark with Spark SQL
Spark has seen rapid adoption across the enterprise as a solution for data processing. Since it has been designed to perform with data up through the petabyte level and the ability to distribute processing across thousands of nodes, it’s ideal for large-scale data processing and data load. The fact that Spark is also a unified stack containing several closely-integrated components, including streaming and machine learning, is also very attractive. This unification of components makes maintaining a diverse stack often required for data pipelines both easier and less expensive.
For the same reasons Spark is great at processing data, it can also be used for analytics by taking advantage of the Spark stack’s package for working with structured data: Spark SQL. Users can take advantage of Spark’s performance to analyze structured data in a familiar and efficient way using a SQL-like interface. It can also demonstrate the clear advantages of Schema-on-Write versus Schema-on-Read: the ability to copy and query data in its native format, thereby enabling ETL on the fly, has obvious advantages for analyzing data at large scale.
SQL has effectively proven itself as the natural language for data analysis - favored by analysts to easily write productive queries and is now supported across the big data ecosystem. Being able to write SQL to a data source already optimized for performance, like Spark SQL, enables analysis to happen at the most granular level, without having to utilize a complex ETL process where pre-aggregated data is moved elsewhere for exploration and visualization. The underlying power of Spark can be used in an intuitive and powerful way to transform data at the time of query, such that row level of data can be examined.
Looker on Spark SQL
Since Looker operates entirely in-database to complete data transformation via a modeling layer that serves as an abstraction of SQL, it utilizes the native database dialect to do all the analytical workloads. By design, Looker’s architecture takes advantage of SQL dialects in their native database by utilizing their underlying performance. Looker’s support for Spark SQL puts this architecture to optimal use - by having the ability to return valuable results and insights directly from Spark’s powerful compute engine, with no intermediary step. Looker takes advantage of Spark's speed/processing power which is more than 100x faster than traditional MapReduce job. The LookML modeling language enables Analysts and Data Scientists to curate an experience for end users such that they can start visualizing and exploring their data without needing to write a single line of Python, Scala, Java, or even SQL.
How it works
Looker is a lightweight application that can be installed on-premise or in the cloud. Once configured, Looker can connect to Spark SQL’s thrift server using a standard JDBC connection. The data is accessed where it lives, alleviating the need to summarize or warehouse and providing analytics as real-time and the data source.
Using the JDBC connection, Looker is able to write SQL to Spark and return tabular result sets and their visualizations to the browser - using a business user friendly drag-and-drop interface. This functionality is made possible via the modeling layer, which is built using a language that functions as an abstraction of SQL called LookML.
Through LookML, join relationships between tables in the underlying database and tables derived in the modeling layer are established to create starting points for data exploration. Data analysts are able to describe the underlying data and desired data transformations in those tables using a combination of LookML functions and the HiveQL dialect, which is supported by Spark SQL. These data descriptions, referred to in Looker as dimensions and measures, effectively serve as query building blocks or SQL snippets.
These SQL snippets are accessible via the previously mentioned drag-and-drop interface, called Explore, to facilitate data analysis and visualization by the end user. By selecting the required elements to create a report, these snippets are combined together at runtime to execute an optimized Spark SQL query that produces the desired result set, enabling even users without any knowledge of Spark to take advantage of its querying capability.
Looker is web-based, which users are accustomed to and have come to expect from a modern BI tool. This native feature allows for such functionality as canceling a query when a user closes or stops a browser page, which can preventing the cluster from being overloaded accidentally.
Why Looker on Spark?
Because Looker can take advantage of Spark SQL’s querying capability directly, the underlying performance and features of the language can be used to explore and visualize structured data. Spark SQL is already optimized for analytical workloads so there isn’t a need to move the data elsewhere to provide data exploration and visualization capabilities. Additionally, Looker can employ Spark SQL’s ability to use Hive UDFs, which can easily enable complex data transformation and analysis through the giant library of open source UDFs already created, as well as custom UDFs developed for specific use cases. Looker takes advantage of the features of Spark that have made it successful a successful data engine - to bring visualizations and insights to the end users who need them.
To try Looker directly against Spark, contact Looker for a free demo.