Business intelligence and the elephant's chain
Aug 1, 2013
Back in the bad old days of traveling circuses, elephant trainers would shackle baby elephants to stakes using large chains. At first, the elephant would pull and pull, but would eventually give up. Over time, the trainers were able to reduce the chain to a rope, and eventually to an even smaller rope. The elephants, believing they couldn't escape, never tested if they could break the bonds.
In the bad old world of Data Warehousing, machines lacked power. Data sets were way too big for the available machines. If you had hundreds of millions of transactions and you wanted to query them all, the query could take days.
You don't want to wait days, right?
To solve the problem, some very smart people came up with the idea of dimensionalizing the data or rolling it into aggregated forms like "data engines" or OLAP cubes.
The results were amazing.
These intermediate caches pre-compute sums, so when you look at lots of transactions over a long period of time, you don't have to read all the data. Building cubes is hard work, because you want to summarize the data all the different ways you might want to look at it.
But, this approach is nearly impossible for the layperson to understand. So the people that could work with cubes were only people in the analyst group. Another restriction was that you had to massage your data from its normal form into a form that could be placed into these intermediate stores.
This made your data function like low-resolution photographs. You get a general idea of the picture, but when you zoom in, it’s indecipherable pixels.
How to best design these beasts became a black art. Building Cubes, for example, were huge projects that took teams of people and a very deep, expensive stack (to move and transform data, roll it up, query it, and report on it).
Enter MPP, In Memory, and essentially big f*ing machines.
Now imagine you have machines that are 1000x faster. Really, that much faster. Modern analytical databases can query 1000x or more the databases of yore (when OLAP was created). Amazon's Redshift for example, in a large cluster, can query a billion rows in a few seconds.
So what? How has the world changed?
The point at which you need to optimize is pushed way out. Way, way out.
If you have a few billion transactions, no problem. Throw it into a bigger machine and you can query it in its raw, unsummarized form. Less than that and life is really easy -- a large MySQL instance should do the trick.
Just let me write the SQL.
But an interesting thing happened -- people are back to writing straight SQL. This is because 95% of the old tool chain is designed to help people transform data. That's not the problem anymore. Since you easily query it in its current form against the transactional schema, just write the damn SQL.
The database vendors were proud of their creations, and told customers, "Just put the data in, you can transform at query time." But the tools to do this never came, so people coded in SQL around it.
Let my people go.
More people are getting access to analytic databases. It’s easy to spin up copies of analytic databases, so more and more folks are gaining the ability to query.
The machines are great at ingesting and transforming the data on the fly, but getting the data out in usable form has been an issue. The data marathon was almost over, but nobody took these implementations "the last mile."
So what's next?
There needs to be a whole new approach to BI. And this is where Looker comes in.
The traditional BI tools were built a long time ago, and had all the handcuffs described above to deal with. They focused on new ways to aggregate data, or ways to do smaller, departmental solutions that could be deployed more quickly.
Looker took a different approach. We architected with the new ecosystem in mind -- a modern replacement for an old stack. We build highly reusable models right on top of transactional databases and/or analytic mirrors. We summarize in a much more intelligent way (entity facts instead of time based facts), and we built for collaboration from the ground-up.
We did this because BI didn't need to be dumbed-down. There's a new class of data analyst who grew up with Google, not Windows. With this, we leverage people who know SQL, and they enable ad-hoc exploration for those that don't.
The result is not just high resolution, but infinite resolution. Looker enables users to look at the entire aggregate, zoom down, and explore the lowest level. We call this approach "Data Discovery 2.0,” and are helping some pretty creative companies get more out of their data--And breaking the elephant's chain.