Accessible data science with BigQuery Machine Learning + Looker
Jul 25, 2018
At Google Cloud Next ‘18 today, Google took a step toward more accessible machine learning with the announcement of a new feature for Google BigQuery called BigQuery Machine Learning (BQML). BQML is a fully managed service that makes it easier for data scientists to build and train machine learning models in BigQuery using SQL syntax.
The traditional data science workflow
Most organizations have failed to realize the value of predictive analytics because the data science workflow requires a lot of resources, and the largest resource consumption often has little to do with the actual discipline of data science or the creation of machine models.
A typical data science workflow can look like this:
- Generate hypothesis & define features -- define relevant attributes (features) of data that they believe can be used to predictive future behavior.
- Prepare training dataset -- from their hypothesis, data scientists will build a training dataset and move the training dataset into data science environment to feed their model.
- Build model in data science environment -- build a model in R or Python and use the features within the training dataset to predict behavior
- Validate the model -- compare how accurate the model predicted real-world behavior and make adjustments until model can be generalizable to be run on real-world data
- Export the model -- once model can be productionized, data scientist moves data back into dedicated data warehouse to export insights of model to their business users.
You might guess that the most important part of the workflow, building and validating a data model, takes the most time in a data scientist’s workflow. However, the breakdown of time actually looks like this:
Frequently the most interesting portion of a data scientist’s job (really their core competency as data scientists)—analyzing and interpreting data—is only a small fraction of their day-to-day responsibilities. Much more of their time is spent munging and cleaning dirty data. In fact, “dirty data” was by far the biggest barrier faced by respondents in Kaggle’s 2017 “State of ML and Data Science” Survey.
And this is because data environments within many companies are messy. Data is strewn across various tools and departments, so data scientists spend a vast amount time simply preparing the dataset for their analysis and moving that data into a place where they can do their work.
Google, one of the leaders for AI and machine learning, is leveraging their BigQuery database solution to help address this problem.
The Google BigQuery ML advantage
With BigQuery Machine Learning data scientists can now build machine learning (ML) models directly where their data lives, in Google BigQuery, which eliminates the need to move the data to another data science environment for certain types of predictive models.
Data scientists will still want to leverage dedicated data science environments such as R-Studio and Jupyter Notebooks for more complex analyses. However, for common types of linear and logistic regression models, a data scientist can dramatically reduce time spent moving and consolidating data by iterating on their machine learning models directly in BigQuery.
A new workflow with BQML + Looker
Once the model has been built and is ready for testing, a data scientist must ensure that the outputs of the model are piped back into the database and made surfaceable for business users. Traditionally, this step might require pushing the data back into a data warehouse or setting up a new data pipeline to bring the data scientist’s work closer to the broader organization.
With Looker on top of BigQuery, this step is eliminated. Because the data never leaves BigQuery, data scientists are able to easily unlock the value of this final step for their business users by immediately pushing the output of their models to their end users in the same methods already being employed on top of BigQuery.
Now, with BQML + Looker, the workflow for data science looks like this:
- Define features -- define relevant attributes (features) of data that they believe can be used to predictive future behavior.
- Build training dataset in Looker -- data scientist can rely on existing business logic and pre-cleaned data to define features in LookML model
- Build model within Google BigQuery -- data scientist easily selects any set of those features to iterate on a ML model directly in BigQuery. BQML objects can be defined inside of Looker with cadence for retraining.
- Operationalize via Looker -- Predictive objects can instantly be used anywhere in the Looker platform, for operational or analytical use cases
Connecting directly with Google BQML reduces additional complexity for data scientists by eliminating the need to move outputs of predictive models back into the database for use, while also increases the time-to-value for business users, allowing them to operationalize the outputs of predictive metrics to make better decisions every day.
We believe the future of data lies in amplifying the capabilities of everyone, from data scientists, to analysts to deliver more value and insights to their organizations, and we’re proud to work with Google to make this vision a reality.
Want to learn more about how Looker improves the data science workflow? Visit our data science solutions page and
learn about how Stack Overflow uses Looker to increase the efficiency of their data science workflows.
Want to understand how to use Looker to leverage your Google Cloud platform? Visit our Google ecosystem page to learn more about Looker’s integration with Google BigQuery.
Ready to see Looker and BQML in action? Request a demo to see the benefits of Google BigQuery and Looker on your data.