Looker and Vertica : ORC reader
Oct 30, 2015
Vertica’s HDFS connector provides a seamless experience when accessing data both in HP Vertica and data housed in HDFS by allowing data access to both via Vertica’s querying interface. This feature set means users can take advantage of both Vertica’s performance and analytical functions across both native Vertica and Hadoop environments from one central location. And since Looker queries Vertica directly, data on HDFS can be explored and visualized in the same way.
Starting with Vertica 7.1 SP2, HP Vertica has improved its HDFS connector by providing enhanced ORC file processing capability, including column pruning and predicate pushdown. This addition allows querying data in files on HDFS or locally to be similar to HP Vertica’s own native columnar format, providing significant performance gains over regular text files. In this tutorial, we’ll show how to create ORC files from text files, use the Vertica ORC reader to create queryable tables from those files in HP Vertica, and finally explore and visualize the data using Looker.
ORC file creation
Optimized Row Columnar (ORC) format is an efficient format for storing Hive data. It improves reading, writing, and processing data by dividing rows into groups called stripes and by storing data within stripes in column order. This effectively enables column pruning, which mimics the columnar properties of data stored in HP Vertica and other MPP databases.
We can easily create an ORC table from an existing text file in Hive. First we’ll create an External Table referencing the original file. In this case the file is pipe delimited and contains 5 columns describing Order Items.
<b>DROP TABLE IF EXISTS</b> tmp_order_items;
CREATE EXTERNAL TABLE tmp_order_items (
DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
LOAD DATA INPATH "/thelook/order_items_out.dat.0"
OVERWRITE INTO TABLE tmp_order_items;
Once we’ve created an External Table on the original text file, we can use that existing table to convert the data into ORC format. This can be done via a simple CREATE TABLE AS statement:
DROP TABLE IF EXISTS order_items;
CREATE TABLE order_items
STORED AS ORC
AS SELECT *
DROP TABLE tmp_order_items;
This process creates a new External Table referencing the data in the original table, but stored in a new file in ORC format. Now that we have the data in the desired ORC format, we can utilize Vertica’s ORC reader and query the data from HP Vertica directly.
Creating the Vertica External Tables
Now that we’ve created the ORC files, we can create External Tables in Vertica so that we can query them. This is simlar to what we did in Hive, but we’ll reference the Vertica data types in the create statement and the location of the ORC files, whether that’s on HDFS or locally stored.
CREATE SCHEMA thelook;
CREATE EXTERNAL TABLE thelook.order_items (
AS COPY FROM '/home/lookerops/thelook_orc/order_items/*' ORC;
Now we can query the order_items table in Vertica as if it were any table native to the database. To test, we’ll run a simple query:
Exploring data in Looker
Using Looker, we can query the ORC based External Table in HP Vertica directly to provide both exploration and visualization capabilities. Because tables created in Vertica using the ORC reader can be queried in the same way as any native Vertica table, Looker is also able to work with them in the same way.
First, we’ll create a view file for order_items. This is accomplished using LookML, which is the modeling language of Looker that serves as an abstraction of SQL. In the view file we’ll describe the dimensions and measures we’d like to expose to our end users. Dimensions will reference columns in the underlying database as well as custom derived dimensions that we can group by to produce measures, which are aggregates like counts, sums, and averages. Because Looker queries Vertica directly, these dimensions and measures effectively function as snippets of SQL to build a desired desired dataset and/or visualization directly from HP Vertica.
For example, below is a sample of the view file for order_items. You can see some of the definitions simply reference underlying table columns, and some require transformation.
- view: order_items
- dimension: id
- dimension: inventory_item_id
- dimension: order_id
- dimension_group: returned
timeframes: [time, date, week, month, year]
- dimension: sale_price
- measure: revenue
- measure: count
drill_fields: [id, orders.id, inventory_items.id]
We can then expose these definitions via explores, which provide a base table and join relationships between other tables we’d additionally like to expose to the end user. Assuming we’ve defined dimensions and measures for other tables in the database, such as orders and users, we can establish the explore definition below:
- explore: order_items
- join: orders
- join: inventory_items
- join: users
- join: products
We can then create Looks (saved query results) in Looker by selecting the desired dimensions and measures from the defined Explore. This enables the end business user to create reports directly from Vertica and the underlying ORC tables without having to write Vertica SQL or Hive to produce reports, visualizations, and dashboards.