There's an ongoing saga in the realm of database languages around SQL. With all of the buzzwords going around with SQL, noSQL, and Hadoop it can be helpful to take a look back in history and see how we got here. Join us as we look back in time to see the long view of database analytics and where we think the industry is heading.
Somewhere around 40 years ago, two computer languages were born, and they're still running the world's computers. At Bell Laboratories, Dennis Ritchie was designing The 'C' programming language) and at IBM, Raymond Boyce and Donald Chamberlin were developing SQL.
The C Language created a common way to talk to machines at the instructional level. SQL created a common way to create, write to, and transform data sets.
Just about every computer is still running C code. Just about every business is still using SQL to deal with its data.
They created the right level of abstraction. They separated the intent from the implementation specific detail. Before C, all code had to be coded against a particular processor or written in a special-purpose language (like COBOL). Before SQL, there was no consistent way of transacting and querying data. Programs' transaction and querying data were hand-crafted and inconsistent.
Languages evolve and fork. C has spawned many new language, like C++, Java, Objective C, and C#. At the core, these languages are designed to be efficient at the machine level, strongly typed and give a large amount of control over the physical device. They've introduced some conveniences and efficiencies (like object orientation, strings, and in some cases, garbage collection), but generally the reason to use these languages is tight physical control over the device.
SQL is all about enabling applications to store, access, and query their data in a consistent, shareable way. Once an organization has their data stored in a SQL database, different parts of the organization can have access to data from different applications. For example, the CRM (customer relationship management) system can see data from the order and fulfillment systems. Before SQL, data from each of these applications would be only available to that single application.
SQL has evolved strangely. While just about every business runs SQL, SQL from different vendors is not the same. In the 80s and 90s, a bunch of companies and people built a bunch of different SQL servers: Oracle, IBM, Informix, Microsoft, InterBase, and some open source MySQL and Postgres.
With C, there is great incentive for code to run the same on different processors. With SQL, the incentive is the opposite. Vendors selling SQL software and machines want you to stick with their products, so they create subtle differences in the language to increase the costs of switching. SQL vendors knew they had to be similar enough that applications could support multiple vendor implementations, but different enough that once you bought one, you weren't likely to change.
Somewhere along the line, datasets started getting too big to handle. SQL is both a transactional language and an analytical language. Transactional data is written into transactional databases, data gets moved (Extracted, Transformed, and Loaded, aka ETLed) into data warehouses.
Data got big enough that strategies were created for dealing with large datasets. The most common is OLAP, a strategy for precomputing answers into intermediate tables so queries can finish in a reasonable amount of time.
In the late 90s and early 2000s, most startups stopped using commercial databases in favor of the open source derivatives. MySQL was the most popular with the larger companies. MySQL was used at Google, Facebook, and Twitter as the underlying data store. Their datasets grew large enough that data didn't fit into a single machine. They began using techniques like sharding to distribute the data across multiple machines.
Unfortunately, if you shard your data (store different parts of the same logical data on different servers), it becomes very difficult to run SQL queries. Once the data is sharded across databases, SQL for analytics stops working (without some effort to bring it all back together). With even larger data sets, like Google's query logs, data was spread out among a bazillion instances.
Google introduced MapReduce to the world. MapReduce takes large amounts of data (mostly log data) spread out across a network and reduces it into a more usable form (also distributed).
MapReduce made it possible to digest huge amounts of data in a distributed way, but it suffers from all the same problems that led to the rise of SQL in the first place. MapReduce code is hand-crafted, not reusable, and expensive to write and deploy.
Internet programming is often about getting things done fast. The most common databases used for Internet transactional SQL are MySQL and Postgres. Languages like Python, PHP, Ruby, Perl and Java often have interface layers that sit between the raw SQL and the language. These interface layers are called ORMs.
ORMs have a benefit in that they abstract the underlying SQL (remember all SQLs are subtly different). This makes the code portable between database vendors, somewhat faster to write, and more uniform. If you have hopes of getting big, some of the ORMs have built-in sharding capabilities.
An unfortunate effect of ORMs is that many new programmers don't actually learn SQL and don't know how to ask and answer questions using SQL.
If you are planning on sharding across multiple mySQL servers because you think it will be so big (and you can't run a query across the multiple servers once you shard), why bother using a transactional database at all?
This thought lead to the rise of mongoDB and other object stores. The ORM for mongo is much more natural. Just persist an object into the database and read it out. Works great, right? Theoretically, it scales. And if configured right, it probably does.
The only problem is that actually running any kind of query takes a programmer.
At the same time NoSQL was cranking up, another trend was starting. Analytical SQL was making a major comeback. Google's noise about MapReduce signaled a real opportunity and a bunch of companies were founded in the space.
The idea was, that if you had a SQL engine that didn't need to do transactions and was built in a distributed architecture, how fast could you make it go? The answer: pretty damn fast.
Greenplum, Vertica, ParAccel and probably some other vendors designed software and built dedicated clusters that you could upload vast amounts of data and run SQL queries that ran very fast. How fast data could be loaded and queried became the benchmark.
These new SQL engines are amazing. They can easily store an entire company's data history in its raw form and can do historical research in seconds.
Hadoop showed up as an open source version of MapReduce (this is an oversimplification). There are bunch of companies building tools around Hadoop, but Hadoop doesn't have a SQL interface. And then it does—Hive software lets you query, albeit weakly, in SQL.
Splunk, a data analysis system also implements a specialized subset SQL for analytical querying.
Google introduced BigQuery, its BigTable implementation and its own subset of SQL for analytical querying.
The reason SQL matters today is the same reason it always mattered, to give a common way to ask and answer questions. The easier it is to conjure up a question, the faster you can get that answer, the more informed business decisions you can make.
Some very successful companies in recent times have simply been better at seeing and understanding their data. Google, of course, is the obvious one. Facebook might be less obvious, but their focus on measuring things made them successful. Walmart is often thought of as a technology company. Much has been written about Target's targeting customers.
If you can't see, you can't fly.
SQL is not without its problems. While SQL is tremendously useful, it has that same flavor of programming in assembly language or Perl (write-only code, you can't read it). You understand what you are writing when you write it, but if you come back a week later, you have to rethink the whole thing again to understand it. Worse, if you look at someone else's SQL code, there is a good chance you might not understand it.
SQL code is passed around the same way you might pass around an incantation. A coworker might email you a query "Here, run this thing and it will give you your answer." You might modify the query by changing the date or dropping or adding a column, but leave the general structure of the query alone.
Over time, you have multiple copies of the original copied query, and you are not sure which one you should be using.
If you need a new query, you go to a SQL guru, and she writes if for you. You have a new query to start forking.
We've worked really hard trying to simplify SQL's problems and we think we've done it.
The LookML language is a functional model-based approach to building analytical SQL queries. LookML abstracts your dataset into models, views, and fields. In LookML, you describe the relationships between the views, how fields are computed (directly from the underlying table, computed from something in the table, either in the simple or in the aggregate). Field computations can reference other fields.
The upshot of this is that all queries, no matter how complex, can be represented in five things: a model, a view, a list of fields to compute, a set of filters (on the fields), and a sort order.
It turns out that when you simplify the query model, just about anyone can use it. It also turns out that when you create a framework where you only need to understand the field you're building at the moment, you can build models that in turn can write queries that you would never have the mental stamina to write by hand (but execute beautifully in SQL).
We're proud of LookML, we think it's a beautiful thing. We hope you will too.