How to overcome the restrictions of multi-pass SQL
May 14, 2020
There is often a misconception that multi-pass SQL queries result in faster performance. Multi-pass SQL by itself is not necessarily a good or a bad feature. It all depends on the analytics solution’s architecture and whether it requires multi-pass SQL to achieve certain results.
In legacy analytics tools, the common paradigm is to run multi-SQL queries (in a ‘pass’) and then join the results based on some common dimensions. Data warehouse tables are often offered at some level of dimensionality to produce reports.
These legacy analytics tools were designed for older, expensive databases, so most of the analysis/query processing was handled by the analytical tool. Some aging tools use a ROLAP (Relational Online Analytical Processing) engine that executes SQL directly against a relational database. This allows users to take advantage of an underlying data model and execute queries against large volumes of data. Some legacy tools use this architectural approach to generate multi-pass SQL queries to extract data from multiple tables at different granular levels.
A modern approach would be to build an abstraction layer on top of the relational data source that would allow users to embed their business logic/rules within the application layer.
In this blog, we study how multi-pass SQL queries are not necessarily mandatory for a flexible and performant semantic layer.
Why do some business intelligence architectures require multi-pass SQL?
Consider a reporting scenario where you need multiple measures (e.g. Inventory and Sales) which do not exist in a single fact table. This means there are separate fact tables for inventory and sales data. In some BI tools, the semantic layer consists of individual objects mapped to columns in tables or views in a database. This forces execution of separate SQL passes for the sales and inventory measures in temporary tables, and a final pass where it joins these temporary tables to generate the final result set. This is purely a function of how the data is modeled.
The same logic would even apply for measures in the same table that are calculated at different levels. Let’s consider a scenario where measure A and measure B exist in the same table but are grouped by (or filtered by) different dimensions. In this scenario, the same BI tool executes separate passes against the same table, but has different “group by” (or “where”) clauses. Even in this example, the tool’s SQL engine will require a final pass to generate a result set.
The downside to multi-pass SQL and this kind of architecture is that as business requirements change, it forces modeling changes to the backend warehouse and remapping of objects to new tables. This requires a regression testing exercise to ensure that the new modeling changes have not adversely affected the SQL of the existing reports. Thus, as requirements change, there is a development effort not only on the metadata layer, but also on the warehouse layer. This adds a layer of complexity as projects become larger. A badly designed warehouse or metadata layer quickly adds to the technical debt that most organizations want to avoid.
How Looker’s flexible semantic layer does not require multi-pass SQL
A modern solution like Looker approaches this issue in a very different way. Looker introduces an abstraction layer in the form of a modeling language called LookML. You can design your own model for reporting within Looker with LookML.
Looker’s revolutionary modeling layer frees you from having to think this way. Looker can work in a multi-pass SQL way, but it isn’t necessary. In LookML, you care about relationships between tables and the calculations you would like to perform. Looker will then figure out how to write the SQL for you — a task that can take multiple passes to complete in other tools. Looker’s patented Symmetric Aggregate allows Looker to make calculations regardless of join pattern, making most multi-pass SQL obsolete.
Looker does multi-pass in the form of transformation. Many transformations that would be done by a data engineering team can be done by data analysts directly in LookML.
The advantage of this approach is that as your business requirements change, you can quickly design new dimensions, aggregations, calculations, and relationships using LookML without requesting any changes to the warehouse model. The flexibility of the LookML modeling language even allows you to join the same view twice if there is a need for a “self-join”, which other tools would typically achieve in separate passes of SQL by using table aliases. In these cases, you can simply define your Looker Explores. Looker ‘Explores’ are very big, very wide logical tables with both dimensions and aggregate calculations. Looker’s Explores also define all the joins between views and define the appropriate relationship between these views. Simply pick the dimensionality and aggregations you like and Looker will make the calculations for you, regardless of how the data is joined.
Thus, Looker’s architectural approach does not require the multi-pass SQL feature. Instead, Looker developers can define their Explores by combining data from different database tables and, using the “Merge Results” feature, can create a query from an Explore and then add queries from other Explores to display the merged results in a single table. From there, you can examine the data, pivot fields, and create visualizations. All this is done with no performance drawback, because we can now rely on contemporary columnar data warehouses to be fast.
To top it all, since Looker’s business logic layer is separate from underlying data and shared across all downstream content, this separation enables immediate changes to the data model and instant propagation of downstream updates as business rules evolve and change over time. No need to stress about remapping of objects to new tables.
Learn more about the Looker data platform.