Looker Blog : Data Matters

Looker and Vertica : ORC Reader

Erin Franz, Data Analyst

Oct 30, 2015

Vertica’s HDFS connector provides a seamless experience when accessing data both in HP Vertica and data housed in HDFS by allowing data access to both via Vertica’s querying interface. This feature set means users can take advantage of both Vertica’s performance and analytical functions across both native Vertica and Hadoop environments from one central location. And since Looker queries Vertica directly, data on HDFS can be explored and visualized in the same way.

Starting with Vertica 7.1 SP2, HP Vertica has improved its HDFS connector by providing enhanced ORC file processing capability, including column pruning and predicate pushdown. This addition allows querying data in files on HDFS or locally to be similar to HP Vertica’s own native columnar format, providing significant performance gains over regular text files. In this tutorial, we’ll show how to create ORC files from text files, use the Vertica ORC reader to create queryable tables from those files in HP Vertica, and finally explore and visualize the data using Looker.

ORC File Creation

Optimized Row Columnar (ORC) format is an efficient format for storing Hive data. It improves reading, writing, and processing data by dividing rows into groups called stripes and by storing data within stripes in column order. This effectively enables column pruning, which mimics the columnar properties of data stored in HP Vertica and other MPP databases.

We can easily create an ORC table from an existing text file in Hive. First we’ll create an External Table referencing the original file. In this case the file is pipe delimited and contains 5 columns describing Order Items.

<b>DROP TABLE IF EXISTS</b> tmp_order_items;

CREATE EXTERNAL TABLE tmp_order_items (
  id INT,
  order_id INT,
  sale_price DOUBLE,
  inventory_item_id INT,
  returned_at STRING
)

  ROW FORMAT 
  DELIMITED FIELDS TERMINATED BY '|'
  STORED AS TEXTFILE;

LOAD DATA INPATH "/thelook/order_items_out.dat.0"
 OVERWRITE INTO TABLE tmp_order_items;

Once we’ve created an External Table on the original text file, we can use that existing table to convert the data into ORC format. This can be done via a simple CREATE TABLE AS statement:

DROP TABLE IF EXISTS order_items;

CREATE TABLE order_items
STORED AS ORC
AS SELECT  *  
FROM tmp_order_items;
DROP TABLE tmp_order_items;

This process creates a new External Table referencing the data in the original table, but stored in a new file in ORC format. Now that we have the data in the desired ORC format, we can utilize Vertica’s ORC reader and query the data from HP Vertica directly.

Creating the Vertica External Tables

Now that we’ve created the ORC files, we can create External Tables in Vertica so that we can query them. This is simlar to what we did in Hive, but we’ll reference the Vertica data types in the create statement and the location of the ORC files, whether that’s on HDFS or locally stored.

CREATE SCHEMA thelook;
CREATE EXTERNAL TABLE thelook.order_items (
id INT,
order_id INT,
sale_price float,
inventory_item_id INT,
returned_at varchar(80)
)
AS COPY FROM '/home/lookerops/thelook_orc/order_items/*' ORC;

Now we can query the order_items table in Vertica as if it were any table native to the database. To test, we’ll run a simple query:

SELECT * 
FROM thelook.order_items
limit 10

The Look Order Items

Exploring Data in Looker

Using Looker, we can query the ORC based External Table in HP Vertica directly to provide both exploration and visualization capabilities. Because tables created in Vertica using the ORC reader can be queried in the same way as any native Vertica table, Looker is also able to work with them in the same way.

First, we’ll create a view file for order_items. This is accomplished using LookML, which is the modeling language of Looker that serves as an abstraction of SQL. In the view file we’ll describe the dimensions and measures we’d like to expose to our end users. Dimensions will reference columns in the underlying database as well as custom derived dimensions that we can group by to produce measures, which are aggregates like counts, sums, and averages. Because Looker queries Vertica directly, these dimensions and measures effectively function as snippets of SQL to build a desired desired dataset and/or visualization directly from HP Vertica.

For example, below is a sample of the view file for order_items. You can see some of the definitions simply reference underlying table columns, and some require transformation.

- view: order_items
  sql_table_name: thelook.order_items
  fields:

  - dimension: id
    primary_key: true
    type: int
    sql: ${TABLE}.id

  - dimension: inventory_item_id
    type: int
    sql: ${TABLE}.inventory_item_id

  - dimension: order_id
    type: int
    sql: ${TABLE}.order_id

  - dimension_group: returned
    type: time
    timeframes: [time, date, week, month, year]
    sql: ${TABLE}.returned_at::timestamp

  - dimension: sale_price
    type: number
    sql: ${TABLE}.sale_price

  - measure: revenue
    type: sum
    sql: ${sale_price}

  - measure: count
    type: count
    drill_fields: [id, orders.id, inventory_items.id]

We can then expose these definitions via explores, which provide a base table and join relationships between other tables we’d additionally like to expose to the end user. Assuming we’ve defined dimensions and measures for other tables in the database, such as orders and users, we can establish the explore definition below:

- explore: order_items
  joins:
    - join: orders
      foreign_key: order_id

    - join: inventory_items
      foreign_key: inventory_item_id

    - join: users
      foreign_key: orders.user_id

    - join: products
      foreign_key: inventory_items.product_id

We can then create Looks (saved query results) in Looker by selecting the desired dimensions and measures from the defined Explore. This enables the end business user to create reports directly from Vertica and the underlying ORC tables without having to write Vertica SQL or Hive to produce reports, visualizations, and dashboards.

Business Pulse Dashboard

Next Previous

Subscribe for the Latest Posts