Derived Tables

LookML
Version

On this Page
Docs Menu

Derived tables are important tools in Looker that enable you to expand the sophistication of your analyses. In some cases, they can also play a valuable role in enhancing query performance.

At a high level, Looker’s derived table functionality provides a way to create new tables that don’t already exist in your database. These tables can be temporary and built at query time, or they can be stored in your database. Either way, you define a derived table by writing a SQL query, and the results of that query become the derived table. You can then reference the derived table in a LookML view, just like any other table.

Two Types of Derived Tables

There are two types of derived tables:

Regular or “Ephemeral” Derived Table

A regular derived table - sometimes called an “ephemeral” derived table - is temporary and not written to your database. Its query is run every time a user requests data from it. Behind the scenes, Looker achieves this behavior by using a common table expression or a temporary table. If you’ve written complex SQL queries by hand before, the way you use an ephemeral derived table is similar to the way you would use a subquery.

Because an ephemeral derived table runs a new query on your database every time a user requests data from it, you’ll want to be sure that the SQL you write is performant and won’t put excessive strain on your database. In cases where the query will take some time to run, a persistent derived table is often a better option.

Persistent Derived Table

A persistent derived table - sometimes called a “PDT” - is written into a scratch schema on your database and regenerated on a schedule of your choosing. When the user requests data from it the table has already been created (in most cases), reducing query time and database load.

In order to use this feature you do need to create a scratch schema on your database, and your database needs to support writes. Most companies set this up when they initially configure Looker (instructions on this page), and of course you can also do so after your initial setup. However, there are some read-only database configurations that don’t allow persistence to work (most commonly Postgres hot-swap slaves). In these cases you can fall back on ephemeral derived tables.

Simple Example

Let’s consider an example to clarify the concept. Suppose your database already contains a table called order, and you’d like to summarize some of that order data by customer. We can create a new derived table named customer_order_facts to do this. Your SQL transformation might look like this:

The LookML for this derived table would look like this:

- view: customer_order_facts derived_table: sql: | SELECT customer_id, MIN(DATE(time)) AS first_order_date, SUM(amount) AS lifetime_amount FROM order GROUP BY customer_id   fields: - dimension: customer_id type: number primary_key: true sql: ${TABLE}.customer_id   - dimension_group: first_order type: time timeframes: [date, week, month] sql: ${TABLE}.first_order_date   - dimension: lifetime_amount type: number value_format: '0.00' sql: ${TABLE}.lifetime_amount
view: customer_order_facts { derived_table: { sql: SELECT customer_id, MIN(DATE(time)) AS first_order_date, SUM(amount) AS lifetime_amount FROM order GROUP BY customer_id ;; } dimension: customer_id { type: number primary_key: yes sql: ${TABLE}.customer_id ;; } dimension_group: first_order { type: time timeframes: [date, week, month] sql: ${TABLE}.first_order_date ;; } dimension: lifetime_amount { type: number value_format: "0.00" sql: ${TABLE}.lifetime_amount ;; } }

There are a few things to take note of:

  1. We’ve used the derived_table parameter to turn the view into a derived table.
  2. The column names of your result set will be the names you need to reference in your dimensions, such as ${TABLE}.first_order_date. For this reason, you should make sure to alias each column with a clean name by using AS. This is why we wrote MIN(DATE(time)) AS first_order_date instead of simply MIN(DATE(time)).
  3. Other than a few derived table parameters (derived_table and sql) at the top of the view, this view works just like any other.

Adding Persistence

As noted above, as long as you have created a scratch schema to enable persistence (instructions on this page), you can store derived tables in your database to increase query speed and reduce database load. Many users who can use persistence always use persistence, as typically the additional storage space required is not an issue.

Two Types of Persistence

sql_trigger_value

The primary method you’ll use is to specify a schedule on which the persistent derived table (PDT) will rebuild. This means that users will not have to wait for the PDT to be created. If the user happens to request data from the PDT while it is being rebuilt, data will be returned from the existing table until the new table is ready.

You achieve this kind of persistence by using the sql_trigger_value parameter.

persist_for

The other option is to set the length of time the derived table should be stored before it is dropped. If a user queries the table before that length of time has been reached, they’ll receive data from the existing table. Otherwise, it will be rebuilt the next time a user requests data from it, meaning that particular user will need to wait for the rebuild. You achieve this kind of persistence by using the persist_for parameter.

There is no parameter that means “make this derived table persistent”. Rather, the addition of either the sql_trigger_value or persist_for parameter automatically creates persistence.

Query Performance and Persistence

Because PDTs are actually stored in your database, you should specify indexes - using the indexes parameter - for SQL dialects that support them. Or, if you’re using Redshift, you can specify regular sort keys (using sortkeys), interleaved sort keys (using indexes), and distribution keys (using distribution).

If we wanted to add persistence to our original example we could set it to rebuild every day at midnight, and add indexes on both customer_id and first_order_date, like this:

- view: customer_order_facts derived_table: sql: | … sql_trigger_value: SELECT CURDATE() indexes: [customer_id, first_order_date]
view: customer_order_facts { derived_table: { sql: … ;; sql_trigger_value: SELECT CURDATE() ;; indexes: ["customer_id", "first_order_date"] } }

If you don’t add an index (or Redshift equivalent), Looker will warn you that you should do so as a means to improve query performance.

If you are going to use persistence, you should be mindful of the underlying query’s duration and start time, because it will have an impact on how PDTs are generated. These are the things to keep in mind:

  • If a PDT has not been generated, and a user requests data from it, generation will start immediately. If many users are active at the same time, many PDTs can be generated simultaneously (subject to your database connection limit).

  • During automatic generation that was not triggered by a user, sql_trigger_value tables only generate one at a time for any given connection. Consequently, if several PDTs are scheduled to regenerate at the same time, and they all use the same connection, only one will actually start generation at that time. The others will wait for their turn.

  • If you have two PDTs (we’ll call them PDT_1 and PDT_2), then make PDT_2 reliant on PDT_1, the PDT_1 table will need to finish generating before Looker can start generating PDT_2. Making one PDT reliant on another is discussed later on this page.

The takeaway is that long-running PDTs can slow or delay the creation of other PDTs. They can also slow down other queries for users while the database is working hard to generate the large PDT.

Persistence in Development Mode

When you add persistence to a derived table, there are some development mode behaviors you should know about.

Development Mode Behavior

If you query a PDT in development mode without making any changes to its SQL, Looker will query the production version of that PDT. However, as soon as you make any change to the SQL and query it, a new development version of the PDT is created. This enables you to test changes without disturbing end users.

If a development verison of a PDT is created, it will always be persisted for a maximum of 24 hours, regardless of the sql_trigger_value or persist_for you’ve used. This ensures that development mode tables are cleaned up on a frequent basis and don’t clutter your database.

After you push changes to production, Looker will immediately begin treating your development table as the production table (as long as you haven’t used the conditional SQL described below). This means that users will not have to wait for a new version of the production table to be built.

Table Sharing and Cleanup

Within any given Looker instance, Looker will share PDTs between users if they have the same SQL, and have the same sql_trigger_value or persist_for setting. Additionally, if the SQL that defines a PDT ever ceases to exist, Looker drops the table.

This has several benefits:

  • If you haven’t made any changes to a PDT in development mode, your queries will use the existing production tables
  • If two developers happen to make the same change to a PDT while in development mode, they will share the same development table
  • Once you push your changes from development mode to production mode, the old production SQL does not exist anymore, so the old production table is dropped
  • If you decide to throw away your development mode changes, that SQL does not exist anymore, so the unneeded development tables are dropped

Working Faster in Development Mode

Finally, there are cases when the PDT you’re creating may take a long time to generate, which can be frustrating when testing lots of changes in development mode. Looker supports a conditional WHERE clause for development mode that helps manage this:

- view: my_view derived_table: sql: | SELECT columns FROM my_table WHERE -- if prod -- date > '2000-01-01' -- if dev -- date > '2015-01-01'
view: my_view { derived_table: { sql: SELECT columns FROM my_table WHERE -- if prod -- date > '2000-01-01' -- if dev -- date > '2015-01-01' ;; } }

In this example, the query will include all data from 2000 onward when in production, but only the data from 2015 onward when in development mode. Using this feature in clever ways to limit your result set, and increase query speed, can make development mode changes much easier to validate.

If you use this feature, please keep in mind that tables will never be shared between production mode and development mode. This will use more storage on your database, and it means that the production mode table will need to rebuild after you commit your changes to production.

Users Can Override Persistence

End users can override the persistence settings you create if they need to get the most up-to-date data. By choosing Rebuild Derived Tables and Run from the gear drop down in the upper right of a query, users can force every PDT referenced in the query to rebuild:

This feature enables a user to query up-to-date data when they need it. While that specific user is waiting for the tables to rebuild, other users’ queries will still use the existing tables. Therefore, at least in that sense, other users won’t be negatively impacted. However, other users could be affected by the additional load on your database. If you are in a situation where triggering a PDT rebuild during business hours could put an unacceptable strain on your database, you may need to communicate to your users that they should never rebuild certain PDTs.

Referencing Derived Tables in Other Derived Tables

It is possible to reference a derived table in the SQL of another, which is sometimes called “cascading derived tables”. To use them, you’ll need to use the syntax ${derived_table_view_name.SQL_TABLE_NAME}. You can also use the ${derived_table_view_name.SQL_TABLE_NAME} syntax within the sql_trigger_value parameter.

The string SQL_TABLE_NAME is a literal string and should be written exactly; it’s not a variable that needs to be replaced by anything.

Although it’s not always required, it’s often useful to alias tables when you refer to them in this manner, for example:

${derived_table_view_name.SQL_TABLE_NAME} AS derived_table_view_name

This is because, behind the scenes, persistent derived tables are named with lengthy codes in your database. In some cases (especially ON clauses) it’s easy to forget that you need to use the ${derived_table_view_name.SQL_TABLE_NAME} syntax to retrieve this lenghty name. An alias can help to prevent this type of mistake.

In this example we’re creating a derived table from events that cleans out unwanted rows, and then creating a summary of that. The event_summary table regenerates whenever new rows are added to clean_events.

- view: clean_events derived_table: sql: | SELECT * FROM events WHERE type NOT IN ('test', 'staff') sql_trigger_value: SELECT CURDATE()   - view: event_summary derived_table: sql: | SELECT type, date, COUNT(*) AS num_events FROM ${clean_events.SQL_TABLE_NAME} AS clean_events GROUP BY type, date sql_trigger_value: SELECT MAX(id) FROM ${clean_events.SQL_TABLE_NAME}
view: clean_events { derived_table: { sql: SELECT * FROM events WHERE type NOT IN ('test', 'staff') ;; sql_trigger_value: SELECT CURDATE() ;; } }   view: event_summary { derived_table: { sql: SELECT type, date, COUNT(*) AS num_events FROM ${clean_events.SQL_TABLE_NAME} AS clean_events GROUP BY type, date ;; sql_trigger_value: SELECT MAX(id) FROM ${clean_events.SQL_TABLE_NAME} ;; } }

This particular example could be done more efficiently in a single derived table, but it’s useful for demonstrating derived table references.

Monitoring and Troubleshooting Persistent Derived Tables

Companies that use many PDTs - especially if they are cascading - often want to understand the status of those PDTs. Looker offers several tools to investigate these tables, which are described on this page.

When attempting to troubleshoot persistent derived tables, please keep the following items in mind:

  • When investigating the derived table log, pay special attention to development tables vs production tables as you do so. Confusion often occurs around the difference between development and production tables.
  • Validate that no changes have been made to the scratch schema where Looker stores persistent derived tables. If changes have been made you may need to update the Connection settings in the Admin section of Looker, and a restart of Looker may be required to restore normal PDT functionality.
  • Determine if there are problems with all PDTs, or just one. If there is a problem with one, then the issue is likely due to a LookML or SQL error.
  • Determine if problems with the PDT correspond with the times when it is scheduled to rebuild.
  • Make sure that all sql_trigger_value queries evaluate successfully, and return only one row and column, by running them in the SQL Runner (applying a LIMIT protects from runaway queries).
  • Verify that the SQL of the PDT runs without error by running it in the SQL Runner (applying a LIMIT keeps query times reasonable).
  • Check that any tables on which the problem PDT depends - whether normal tables or PDTs themselves - exist and can be queried.
  • Use the Show Processes button in the SQL Runner. If there are a large number of processes active, this could slow down query times.
  • Monitor comments in the query. Database administrators can easily differentiate normal queries from those that generate PDTs. Looker adds a comment to the CREATE TABLE ... AS SELECT ... statement that includes the PDT’s LookML model and view, plus a unique identifier (slug) for the Looker instance. If the PDT is being generated on behalf of a user in development mode the comment will indicate the user’s ID. This comment can be seen in a database console, as shown in the following example of a PDT being generated from development mode:

    The PDT generation comment can also be seen in an explore’s SQL tab if Looker had to generate a PDT for the explore’s query. The comment appears at the top of the SQL, as shown in the following example:

    Finally, the PDT generation comment appears in the details of the Queries admin page, as shown in the following example:

Still have questions?
Go to Discourse - or - Email Support
Top