Looker Blog : Data Matters

Data Science with Looker: Part II

Haarthi Sadasivam, Data Evangelist

Apr 13, 2017

Advanced Analytics Part II

Data Science with Looker and Python: Part II

To show how seamlessly Looker can integrate into a data science workflow, we took a public dataset (Seattle bikeshare data) and applied a predictive model using Looker, Python, and Jupyter Notebooks.

Follow along as I walk through the setup.

We loaded nearly 2 years (October 2014 - August 2016) of historical daily trip data for Seattle’s bikeshare program into BigQuery. Because we know that the “rebalancing problem” is a key cost driver for bike sharing programs, we thought we’d explore the impact that weather has on trip volume. To do that we’ve imported daily weather data (e.g. temperature, humidity, rain) for Seattle alongside the trip data in BigQuery. Based on our model, we’d like to predict future trip counts by station, but more importantly operationalize those insights to automatically facilitate rebalancing bikes to underserved stations.

How might the weather affect people’s willingness to ride?

First, we want to define relationships between different data points within the bike share data (trip information, station facts, and weather forecasts) in Looker in order to easily start exploring this data. LookML lets us define these relationships once and then explore freely.

We can then look for a relationship between daily trip count and temperature by selecting three fields -- Trip Start Date, Temperature, and Trip Count -- in the Looker explore panel. Looker automatically writes SQL, retrieves data, and allows us to visualize the results to see if there could be a correlation between temperature and trip count.

Advanced Analytics Diagram 1

A quick examination of the scatterplot indicates a relationship between temperature and ride volume, so let’s build a regression model that predicts trip count as a function of temperature.

Any dataset that we build in Looker can not only be explored in Looker’s web interface, but also in data science environments as well. With a single API call (by passing in a Query ID, Look ID, or the Query Slug), we can access the generated SQL or pull in the result set itself. We then can run a Python script that creates a simple regression model predicting Trip Count as a function of the daily Temperature.

The actual regression model is straightforward and can be written in python by leveraging a Stats package --

Y = Average Trip Count
X = Temperature

model = smf.ols(formula='Y ~ X', data=data).fit()
print(model.summary())

Data Analysis pre Looker

Based on our model we can determine that --

  1. A 1 degree increase in temperature leads to nearly 12 (11.977) additional trips
  2. Temperature has a significant effect on trip count (P > | t| is indistinguishable from 0)
  3. The R² of our basic model is .62 (meaning the temperature can explain 62% of the variability in daily trip count)

Our model predicting trip count based on just temperature is certainly decent, but can we do better? Let’s try to improve on our model by adding other variables. Intuitively, it seems that the trip count will be affected not just by temperature, but also by the humidity level (or the likelihood that it rained on a particular day or the likelihood that there was snow).

To include another factor like humidity in our model, we simply jump back to Looker to further explore the data. We can add humidity as an additional field and pass in the new dataset to our model by grabbing the updated ID.

Explore the Data

Data Analysis

The model now isolates the effect of both temperature and humidity.

  1. For every 1 degree increase in temperature, trip count increases by 7.74 rides.
  2. For every 1 percent increase in humidity, trip count decreases by 4.62 rides
  3. Both variables have a significant impact on trip count.
  4. Our R² has increased to .686

We can continue to go back and bring in additional variables from Looker into our model by simply adding additional fields to the explore page, and making an API call. But for now let's use temperature and humidity as our two factors or dependent variables that could affect trip count.

For more complex models, we may choose to calculate our predictions in our DSE before piping them back into the database, but for this simple example, we can simply push our model coefficients into our database and use Looker to operationalize these calculations and metrics.

explore: trip {
  join: trip_count_prediction {
    type: cross
    relationship: many_to_one
  }

By joining in trip count predictions (which contains coefficients and intercepts for our linear regression model) in Looker, we can create fields to predict the Average Trip Count based on weather factors. The Trip count calculation, based on the multiple regression model, should look something like the following.

y=β0+β1×Temperature+β2×Humidity+β3×OtherFactorsy=β0+β1×Temperature+β2×Humidity+β3×OtherFactors

  measure: trip_count_prediction{
    type: average
    sql:  (${trip_count_prediction.x0} * ${weather.temperature}) + 
          (${trip_count_prediction.x1} * ${weather.humidity}) + 
          ${trip_count_prediction.intercept};;
    value_format_name: decimal_1
    view_label: "Trip Time Prediction"
  }

We can now can access our predictions right in Looker and can explore real data and predictions right alongside each other. It means that we can also visualize and operationalize this dataset to make sure that the model is continuing to perform well.

Let’s start by comparing our trip counts to those predicted by our model -

Compairing Trips in Looker

Trip Regression

Our predictions seem to be in line with the actual trip counts for those days since a graph of the actual trip count vs the predictive trip count shows a linear relationship. Let’s take this a step further and use our model to forecast average trip times/lengths for future dates based on the same factors we've outlined above (the temperature and the humidity).

We can easily pull in weather information for the upcoming week in Seattle (using third-party weather APIs) and use those points to estimate the number of bike share trips. For the following week, as the temperature decreases to the 40s towards the end of the week, our model predicts that around ~250 trips will be taken.

Trips per Date

We can also take into account a degree of error on our weather forecasts for the week. Let’s create additional metrics within Looker that will allow us to define trip counts based on a variance in the temperature.

  filter: weather_variance {
    type: string
  }

  dimension: adjusted_weather {
    type: number
    sql: ${weather.temperature} + CAST({% parameter $weather_variance %} AS FLOAT64) ;;

  }

  measure: trip_count_prediction_what_if {
    type: average
    sql:  (${trip_count_prediction.x0} * ${adjusted_weather}) +
         (${trip_count_prediction.x1} * ${weather.humidity}) +
         ${trip_count_prediction.intercept};;
    value_format_name: decimal_1
    view_label: "Trip Count Prediction"
  }

Trip count upper and lower bound

Taking the results of our model, we could forecast revenue for the upcoming week based on the number of trip counts, or predict average trip length, and a number of other key business metrics to help drive business decisions.

How might Weather factors affect the bike overflow rate at a given station?

Taking our trip count predictive model one step further, we can predict the overutilization rate (number of bikes taken from a station - number of bikes docked at a station) at certain stations on any given day based on weather conditions like Temperature.

We can pull in data from a particular start station (by temperature) and an end station (by temperature) by exploring data with those fields in Looker.

Then, we build a Linear regression model that predicts the number of trips from a start station and an end station, this time grouped by the Station ID. Our new regression model will have coefficients by Station

Linear Regression Output

Now that we have our predictive model, let's pull it up in Looker and calculate how many Trips were taken per station.

Join in station prediction table in Looker and create fields to predict trips taken based on Temperature.

explore: trip {
  join: station_prediction  {
    type:  left_outer
    relationship: one_to_one
    sql_on:  ${trip.from_station_id} = ${station_prediction.bike_station} ;;
  }

Our predictive measure in Looker, follows the linear regression model that we used and pulls in the coefficients and intercept that have been persisted back into the DB.

Linear Regression Model:

In our case:

Linear Regression Equation Weather

  measure:  start_predictions {
    type:  average
    sql: (${start_slope} * ${weather.temperature} ) +  ${start_intercept};;
    value_format_name: decimal_1
  }

  measure:  end_predictions {
    type:  average
    sql: (${end_slope} * ${weather.temperature} ) +  ${end_intercept};;
    value_format_name: decimal_1
  }

  measure:  predicted_station_overflow {
    type: number
    sql:  ${end_predictions} - ${start_predictions} ;;
    value_format_name: decimal_1
  }

We can now use Looker to visualize and operationalize this data for future dates. The dimensions and measures we have listed above pull in the weather forecasts for the upcoming week and look like the following:

Rides Data Certain Stations have a high overflow rate (more bikes are returned to that station than are taken out from that station). For example, the 2nd Avenue and Pine St Station is in high demand and will tend to have more bikes docked at that station on this particular day.

Next, we can visualize this information to see if we can spot trends. Because we have our data in Looker, we can effortlessly pull in additional fields about the station like the latitude and longitude of the station and plot the results on a map.

Rides Map

It looks like there is heavy overutilization in Downtown Seattle and by the University district and less utilization around residential areas like Capitol Hill. More people could simply be biking into work and taking a different mode of transportation back home.

What does this mean for the Seattle bikeshare program? Looker could easily send out detailed alerts and scheduled emails to employees every morning with details on predicted utilization patterns. These predictions of bike overflow rates for stations across Seattle, could help guide bike rebalancing.

Pretty cool right?

Looker has made it very easy to operationalize this entire workflow by providing easy access to clean data, providing a framework to easily visualize predictive results, and by making the resulting analyses easy to share across the organization. To learn more about the benefits of using Looker in conjunction with your DSE read our blog post and reach out to a looker analyst for a demo.

Next Previous

Subscribe for the Latest Posts