Accelerating machine learning with Looker + Amazon SageMaker
Nov 26, 2018
Amazon and Looker have been strategic partners since shortly after Looker’s inception. Looker hosts its instances in Amazon Web Services (AWS), and over 55% of our clients are using one of the many Amazon-hosted cloud databases such as Redshift, Athena, and various Relational Database Service (RDS) flavors as their primary Looker data sources. With such compatible products and hundreds of joint customers, Looker and AWS are continuously working together to make the end-user experience more streamlined, which makes re:invent one of the annual highlights for our team and customers. This year is no exception.
At AWS re:Invent 2018, we’re announcing an integration with AWS SageMaker, as well as a new trial of Amazon Redshift and Looker. We’re excited about both of these additions because we believe that the combination of Looker and Amazon is truly changing the lives of our joint customers by allowing them to build data-driven cultures and thriving companies.
New Action Hub Integration with SageMaker
Looker has already developed Action Hub integrations that allow Looker to spin Amazon Elastic Compute Cloud (EC2) instances up/down based on a timed or data-triggered schedule. Now, we have a new Action Hub integration with Amazon SageMaker that streamlines the data science workflow by allowing model training and inference to be initiated directly from within the Looker Scheduler.
What does that mean?
That means that, from with within Looker, data scientists can:
- create a query
- visualize it
- filter it
- remove outliers and reshape the data
- select the Action Hub integration to SageMaker
- choose an algorithm (such as XGBoost) for model training
- choose a location in S3 where the model will be saved
- and then SageMaker will handle the rest!
Since training a model is only the first part of the machine learning (ML) process, we’re also launching a second Action Hub integration that closes the loop on predictions. With this integration you’ll be able to:
- point to a saved / trained model in S3
- send over a set of predictive features from a Looker query
- get a result set dropped back into S3 that can be queried (via Redshift Spectrum or Athena)
- see model metrics such as Precision, Accuracy, MAE, and AUC (among others) on a Looker dashboard
- explore and visualize everything in Looker
SageMaker supports a number of different machine learning algorithms via its API. Looker will initially provide integration to two, XGboost and Linear Learner, with others expected to be released on a rolling basis going forward.
How can Looker’s integration with Sagemaker benefit you?
Let’s look at a common example. Suppose, you’re a marketer and are finding it very difficult to anticipate or predict how marketing campaigns will be received.
In this scenario, you can use ML with Looker and SageMaker to create models that attempt to predict which audience members are likely to respond to marketing campaigns based on previous data from similar campaigns. This supervised form of learning is quite effective when the correct features are used.
How does this differ from the traditional data science workflow...
Let’s say you’re a bank looking to offer a term loan to existing customers. You have a set of data from previous campaigns, including things like customer age, income, prior defaults, and number of campaign touches. You’ve blanketed these customers with a term loan offer in the past and are interested to know which types of customers responded positively to the offer. Intuitively, you know that there must be clusters or cohorts of customers with a high likelihood of a positive response (i.e. people with this set of characteristics took a new term loan when offered).
In a traditional data science workflow, you would take all the data, pull it into Python or R, and use that environment to explore the data. You would need to split out a training dataset and a validation data set, as well as holding out some additional data to test the model. Only then could you begin training, defining each input feature (predictor) and providing a (sometimes bewildering) array of hyperparameters specific to the training algorithm. For that, you would need to be pretty conversant with the programming language, the data itself, and the inner workings of the machine learning algorithm. And it all might need to be repeated whenever the input data changed.
With Looker, similar exploration can be done by a business user or data analyst (or a savvy data scientist) using a Looker Explore. The resulting query will be reusable (so it can be reapplied whenever new data arrives) and the results can be automatically sent down to SageMaker, creating a new model or augmenting an existing model with newly arrived data. Furthermore, with SageMaker, you don’t need to have powerful hardware or to manually spin up EC2 instances to handle the training workload. When training on a large dataset, you can specify a larger instance size, or even run multiple instances and have SageMaker handle all the distribution for you. If you didn’t just swoon, ask the data scientist next to you how cool that is.
After you have a well trained model, predictions can be performed in real-time or using a batch transform job. Whenever new data arrives, you can refine the model with further training. WIth the new predictions, now you’re helping to reduce overall marketing costs as well as ensuring that targeted campaigns are reaching the desired customers.
Not using Looker or Redshift yet? We’ve got a new joint trial for you!
Redshift currently offers a trial period to provide a first-hand experience before you commit. The newly announced Looker Redshift Trial Experience will take this a step further, allowing users to seamlessly test out an entire data stack, from data warehouse to analytics to dashboards and actions
To help you get up and running even faster we have a suite of Looker Blocks, pre-built templates of code customized to model data for specific use cases and tools, optimized for AWS users. Some of the Looker Blocks® we co-authored with AWS to allow customers get the most out of their Redshift usage by making it as simple as possible to monitor AWS log data, identify opportunities to improve performance, and isolate levers to help optimize AWS spending. Looker Blocks® drive faster time-to-value and have help joint Redshift customer adoption grow 200 percent over the last two years.
Curious how it works??
You can start the joint free trial here and then…
- load or stream all data into the Amazon S3 data lake
- Amazon Redshift Spectrum can then query from high performance disks or directly from Amazon S3 in open data formats
- Redshift automatically connects with Looker allowing customers to:
- Store, manage and process petabytes of data in seconds
- Access a vast library of advanced analytical functions
- Implement Looker Blocks® to get further faster
- Control distribution patterns, storage architecture and auto-scale at the push of a button
If you want to conduct geospatial analysis or other transactional workflows, you can also load data directly from S3 via Athena.
Still curious? You can learn more about the combination of AWS and Looker here.