The Redshift - SORTKEY and DISTKEY
Dec 4, 2013
Indexing in MPP analytical databases (Redshift, Vertica, Greenplum...) works very differently than in traditional transactional databases.
Suppose I’m looking to count nursing jobs. I’ve got a list of all job postings in Santa Cruz, and they aren’t ordered in any reasonable way. So I start at the first one, read it, and scan down the list looking for the word “nurse.” It takes a couple of seconds—no problem.
Suppose I go regional and I’m looking at the San Francisco Bay Area. That’s 20 times the number of Santa Cruz job postings, and it takes me a few minutes to scan the list and make a count. If I wanted to look at all of California, maybe it would take me an hour. The entire United States—even longer.
Now, suppose I want to analyze and compare all of these job postings across the entire United States. I need to scan these lists many times quickly. It’s too hard for me to do it alone, so I get the great idea to enlist a bunch of friends who can help me. To make things efficient, I distribute smaller pieces of the list for each city to each of my friends. They scan their smaller lists in a few seconds and send me their counts. I combine all these counts for the answers. Teamwork saves the day!
MPP (massively parallel processing) databases are like teams of robot friends.
In the old world of databases, you had just one reader—like me, before I had my friends. If I wanted to look at 10,000 things instead of 1,000 things, it would take 10x as long. One machine, one reader.
In the new world of MPP databases, such as Amazon Redshift and HP Vertica, there are many readers that work in concert to scan a single list. The speed at which they can scan is only limited by the number of readers and the speed at which the results can be combined at the end.
Sorting and distributing: In the world you’re scanning, there are just two tools.
Sorting is pretty simple. Suppose a job list contains all the jobs in the last year. Next to each job is a date and the jobs are in date order. Really, I’m only interested in the last month’s postings. It’s pretty obvious, but I can avoid the work of reading all of them by jumping to the first posting of the month. In the MPP database world, this is a SORTKEY. Each list (or table) in the MPP world has exactly one SORTKEY.
How each part of the list is distributed to my friends actually matters. If my friends were truly random people in California, it might work to send them the list by those closest to the job. (Random people and random jobs will likely be distributed evenly in population centers, making it a decent algorithm.) But my friends tend to be localized—closer to where I am. It would be bad if I had one friend in LA and the rest in SF: My friend in LA would be really overloaded. For this problem, some randomly balanced way of distributing the list is appropriate. The location of my friends is probably not a great DISTKEY.
DISTKEYs and combining with other data.
Suppose I had another list, which was all the employers in California. This list also has some data about the employers, one item of which is the number of employees. Assume this is also a really big list. I want to know the number of nursing job listings from employers with more than 100 employees. In order to figure this out, I’ll have to scan the job postings (like we did before). And when we find a nursing job, we’ll have to look up the employer in the employer list to see if they have more than 100 employees. If they do, we’ll include the job in our output list.
The problem is: Each of my friends will have to have a copy of the entire employer list to complete the task. Since this list is really big, distributing the list to all my friends isn’t feasible.
If I’m clever, I can break up both the jobs list and the employer lists the same way, by ZIP code. I can give lists from multiple ZIP codes to each of my friends and try to do a reasonable job of balancing the work. When I’m looking at a job listing, I don’t need to look at the entire California employer list, just the list in the right ZIP code.
A tale of two keys.
MPP databases are a major advance, though the underlying mechanism is very simple. All data is stored in tables sorted by SORTKEY. You can scan the entire table by filtering based on the SORTKEY.
Lists are partitioned by DISTKEYs. If you’re going to have two large lists that are linked together in some way (a customer and her orders, for example), you’re going to want to distribute portions of that list based on a key, a DISTKEY, that keeps this data together logically.
MPP machines. The robot army.
Once data has been organized in this way, it becomes easy to engage a robot army to solve your problems. The speed at which you can look at data is only limited by the size of your army (the number of machines in your cluster). Want to look at 20B pieces of data in a few seconds? No problem.
In this world, we really do think about problems in a different way. Looker is built to harness the robot army and make it work for you. We see the world differently. So could you.