Looker Blog : Data Matters

Predict Lending Club's loan portfolio returns using HP Vertica, Looker and Distributed R

Segah Meer, Analyst

May 28, 2015

Looker and HP Vertica have been strategic partners for well over a year now. The reason we like Vertica is that it's a powerful, enterprise-grade analytics database that has been built to run massive analytical workloads. The reason Vertica likes Looker is that we've architected our analytics platform to leverage the power and performance of the underlying database by executing queries directly in the database, as well as enabling users to build in Vertica advanced analytics functions directly into the Looker semantic data modeling layer, LookML.

The latest offering from Vertica, Distributed R, is very exciting. Now Vertica users can execute and train R models directly from their cluster, across very large sets of data. With Looker's integration, now data scientists can operationalize their findings with the rest of the organization. While Looker isn't primarily a data scientist's tool, it's a very powerful way for a data scientist to bridge the gap between the advanced work they do and how it can be applied and used by the business population.

In the following case study, we are going to illustrate this workflow using data from the public API of Lending Club's loan application data. This study is split in two parts.

In part I, we will focus on how a data scientist would go about interacting with Looker and Vertica's database during the exploration phase. In part II, we illustrate how to build statistical inferences and utilize the power of a multi-node cluster to process large datasets, while still working from within a familiar environment of R.

Exploring Lending Club Data with Looker

Every day hundreds of loan applications are listed on Lending Club and it is possible to view these applications in a near real-time fashion either directly on LC's web site or through an API. For our purposes, we wanted to be able to slice and dice this finance data however we like.

After setting up the API, the first step was to create a simple Looker Dashboard that would provide insight into incoming applications:

Building Dashboards in Looker

While it is already possible to drill into this particular visualization, we might want to structure our dashboard on what typically matters to any investing process: regions, grade tiers, credit utilization, and time:

Lending Club Looker Dashboard

From this, we can, for example, learn how the use of borrowed money changes over time. By adjusting time filters over the past 3 months we can see that the share of credit worthy applicants applying for a loan because of debt consolidation has dropped. In other words, if we were to confirm this over a longer period, we could hypothesize that the average applicant today differs from the one many months ago.

Historical Analysis

While being able to explore the incoming volume of applications is useful, often times we want to compare applications against a historical dataset of loans with known performance stats. In other words, loans that have been issued awhile ago and have since been either repaid or written off the books. A quick look at our historical dashboard shows that there have been several hundred thousand loans issued in the past.

Lending Club Looker Dashboard 2

Typically, when analyzing historical performance, we are concerned with calculating expected profitability. For a single note then we have the following important calculations:

E(reward) = ( 1 − P(default )) × loan_amount × note_interest

E(loss) = P(default) × loan_amount

E(profitability) = E(reward) - E(loss) = note_interest − P(default) × ( 1 − P(default) )

Since everything on the left side is known for historical data, we could then perform regression analysis in R to forecast loan performance for the latest pool of applications - for which the left-hand values are not yet known.

We could start by loading all this data into a powerful server with lots of RAM memory. But alternatively, Vertica's Distributed R cluster allows us to do the same while keeping data either in shared memory or, where it belongs, in the database.

Discovering Predictive Insights

Data Scientists spend significant time preparing data to build robust and accurate models. The typical workflow requires combining data from different data sources, deriving new attributes, transforming existing attributes, as well as performing normalization and outlier treatment.

These data wranglings can be done much more efficiently by using Vertica SQL and by leveraging Distributed R native parallel connector. R supports a native connector to access data from Vertica locally. Native connector not only simplifies data access in Vertica, but also speeds-up access around 5-6 times over parallel ODBC connectors.

This means that we can now do regression analysis differently. We can:

  1. perform all data transformation using efficient SQL-based manipulations within a database,
  2. load transformed data into a multi-node R cluster

First, we create LookML dimensions that we will use within Looker to transform underlying data. This is especially important for this financial dataset, where there is a lot of manually-entered data and where it can be hard to keep track of all transformations.

- dimension: loan_status
  sql: ${TABLE}.loan_status

- dimension: is_bad
  sql: |
      CASE 
      WHEN ${loan_status} IN ('Late (16-30 days)', 'Late (31-120 days)', 'Default', 'Charged Off')
        THEN 1
      WHEN ${loan_status} = ''
        THEN NULL
      ELSE 0
      END

- dimension: is_rent
  sql: |
      CASE 
      WHEN ${borrower.home_ownership} = 'RENT' THEN 1
      ELSE 0
      END

 - dimension: borrower.revol_utilization
   type: number
   sql: trim(trailing '%' from trim(${TABLE}.revol_util))::float

We will then select the necessary dimensions from Looker's explore window to generate SQL that results in a transformed data:

Lending Club Looker Dashboard 2

This SQL will be used to generate a training set and a model testing set. Because we need to confirm our results on an out of sample set, we leave 20% of data reserved for a 'testing' table.

CREATE VIEW loan_history_for_r AS
(
...
)

CREATE TABLE testing AS (
SELECT *
FROM loan_history_for_r WHERE "loans.is_bad"=0
AND RANDOM() < 0.2);


INSERT INTO testing (SELECT *
FROM loan_history_for_r WHERE "loans.is_bad"=1
AND RANDOM() < 0.2);

CREATE TABLE training AS (SELECT *
FROM loan_history_for_r EXCEPT SELECT * FROM testing);
</div>

Second, we load transformed data into a shared storage. Due to the potential size of data stored in our database, we use a Distributed R cluster made of 3 nodes. Distributed R provides easy to use packages and a fast native connector to load data from Vertica into R, so that the dataset is no longer limited by a memory capacity of a single node.

<div class="code" ace-mode="ace/mode/sql" ace-theme="ace/theme/lookml" ace-gutter="" style="margin-bottom:25px;"> 
library(HPdata)
distributedR_start()
historical_data <- db2darrays(tableName='training',dsn='Vertica', 
      # Y - matrix
      list('"loans.is_bad"'),
      # X - matrix
      list('"loans.issue_year"','"loans.is_rent"',
      '"borrower.revol_utilization"','"borrower.fico_range_high"',
      '"borrower.fico_range_low"','"borrower.inquiries_last_6mths"',
      '"borrower.pub_rec_bankruptcies"'))

Forecasting for Visual Inferences

For simplicity, we focus only on the 7 most important variables as predictors of whether a loan will default, so the final regression looks something like this:

library(HPdregression)
theModel <- hpdglm(responses=historical_data$Y, predictors=historical_data$X, family=binomial)

which is basically the distributed version of an R-equivalent:

historical_data <- glm(factor(is_bad) ~ issue_year _ last_fico_range_high + last_fico_range_low +
pub_rec_bankruptcies + revol_util + inq_last_6mths + is_rent, data = train, family = "binomial")

Finally, after performing all the standard significance tests and successfully validating the predicted estimates against the remaining 20% of historical data (i.e. our out-of-sample subset), we can safely push the model estimates into our database:

deploy.model(model=theModel, dsn='Vertica', modelName='glmLoans', modelComments='A logistic regression model for LendingClub Data')
distributedR_shutdown()

The above command allows us to invoke the results of glmLoans model and apply its resulting coefficients to any other dataset within the database. This means that we can now take LendingClub's loan listing data that currently has no historical default information and we can apply these coefficients to it straight from the database.

Since we are typically interested in using the application's probability of default as a simple attribute of a loan application, the best way to approach this in LookML is to persist an association between every loan application and its expected probability of default back into the database. The sql_trigger_value, then, allows us to set the frequency at which we might be interested in either reflecting changes in coefficients from an updated R model or applying the old model to a newly updated set of loan applications.

- explore: listings
  joins:
    - join: predict_bad_loans
      foreign_key: listings.id

- view: predict_bad_loans
  derived_table:
    sql_trigger_value: SELECT DATE(GETDATE())
    sql: |
      SELECT 
        listings."id" AS listing_id
        , GlmPredict(
            "loans.issue_year","loans.is_rent",
            "borrower.revol_utilization","borrower.fico_range_high",
            "borrower.fico_range_low","borrower.inquiries_last_6mths",
            "borrower.pub_rec_bankruptcies"
      USING PARAMETERS model='dbadmin/glmLoans' ,TYPE='response') AS is_bad_probability
      FROM listings

  fields:
    - dimension: listing_id
      primary_key: true

    - dimension: is_bad_probability
  type: number
  decimals: 2

Without doing much further work, we can already apply our existing dashboard filters and discover what is the expected profit dynamics over time for a pool of applications rated as “Grade C” in the past 14 days with applicants earning more than $1000.

Concluding Remarks

As demonstrated above, the Looker integration for Vertica's Distributed R enables a powerful workflow. Once a Data Scientist has built and trained a model using Distributed R and one of their favorite R consoles, they can now implement it into the LookML modeling layer. Doing so not only allows the Data Scientist to apply the model to the broader analytics environment for further data discovery and analysis, but it also enables business users to leverage it as well to get the insights they need, regardless of the technology and skill used to create it.

Happy R modeling and profitable investing!


Full source code can be found here on Github

Next Previous

Subscribe for the Latest Posts