Ask the experts: data architectures for better analytics with Amazon Redshift
Jun 12, 2015
Many companies are choosing to store their data in Redshift. The cost and efficiency of AWS cloud management, coupled with Redshift's performance, have made it a favorite for companies outgrowing transactional databases like MySQL and PostgreSQL for analytics, looking to centralize disparate data sources for richer analytics, or moving from on-premises deployments to the cloud, amongst other use cases.
To discuss the considerations and tactics for deploying Redshift, Looker hosted a panel discussion including data teams from VigLink and Looker, plus Amazon Redshift Senior Product Manager, Tina Adams. Scott Hoover (Data Scientist at Looker) and Barret Tirona (Analytics Platform Manager at VigLink) explain how to setup data architectures that utilize Redshift, including the specific technology and tactical decisions that they made.
Below are the top 5 responses we found particularly informative. You can view the full panel discussion here: https://info.looker.com/events/aws-panel-san-francisco
For additional use cases from data teams that have integrated Redshift, check out How Warby Parker is Creating a Data-Driven Culture.
Top 5 questions:
- Please provide an overview of your data architecture from collection, to ETL, to delivery.
- What was your journey for arriving at your current data architecture? Were there pain points that drove you to your current solution? Also, how does Redshift fit into your broader data strategy?
- We've seen a lot of teams that are using both Hadoop and Redshift in tandem. When have you seen this approach implemented and what use cases might be better for Hadoop versus Redshift?
- Do you have any tips for people who've got MongoDB data that they're trying to move into some type of SQL schema format? Any common architectures you see, any pitfalls you can advise us on?
- Is Looker more efficient when connected to a star schema in Redshift, or is it efficient directly querying using LookML rather than creating a big fact table with all the dimensions in the star schema?
Q: Please provide an overview of your data architecture from collection, to ETL, to delivery.
Scott, Looker: We have a number of different data sources and a lot of human-input data. Starting at the top of the funnel, we have Snowplow—which is an open-source event collector—to capture Looker website traffic. Marketing uses Marketo to create and track campaigns. Our Sales, Marketing, and Customer Success teams use Salesforce quite heavily. We have an in-house event collector which tracks Looker usage. Product and Engineering monitor Github commits and issues. Our Customer Support team uses Zendesk. The Looker community uses Discourse as a QA forum. And our DevOps team uses Zabbix.
Salesforce is really at the center of our data universe. We use an ETL tool called CopyStorm to move our Salesforce data into a MySQL database. Both Marketo and Zendesk have native connectors to Salesforce, so they're brought in by CopysStorm as well. Snowplow uses Elastic Mapreduce jobs to transform the data and then pipes the structured data into Redshift. We have custom scripts that hit the Github and Discourse APIs and pipe data into PostgreSQL databases.
I came into this role when we were actually playing with the idea of migrating to a proper data warehouse. So I’m in the process of taking all of these disparate data sources and migrating them into Redshift using a tool called DataVirtuality.
Barret, VigLink: We have millions of clicks a day coming through our platform. We have Fluentd logs. That's where we start. We have Fluentd logs sitting in JSON formatted on S3. We use Data Pipeline to manage all of this.
We take the Fluentd logs from S3, we parse them, get them ready to be brought into Redshift, and then load them into the actual Redshift stage schema. From there, we have another step that Data Pipeline manages to actually run scripts that do the aggregation against that data and bring it into the Redshift production schema that we have.
As an aside, we also have MySQL data we're bringing in. Basically, we have RAL logs in MySQL. That's for the revenue stuff. We have our clicks that occur, and then revenue can occur at any point in the maybe 30 to 60 days after that. We do have a separate Data Pipeline job to bring the revenue data into Redshift as well.
In addition to Fluentd logs, there are impressions. There are fewer attributes in our impression data, but we do bring that in. Basically, our measures are clicks, impressions, and revenue.
Q: What was your journey for arriving at your current data architecture. Were there pain points that drove you to your current solution? Also, how does Redshift fit into your broader data strategy?
Barret, VigLink: I've been at VigLink now for a little over two years. When I started, our method of exposing data to business users was generating SSH keys for every single business user and granting read‑only access to the slave MySQL database for them to run their own queries.
Scott, Looker: When I first started at Looker, we relied heavily on our event collector provided. Fast‑forward a year and a half later, it has begun to seriously tax this MySQL database. We've gotten to a point where we can't really do analytics on this dataset, which is why we've decided to migrate to Snowplow and Redshift.
Looker has always paired well with Redshift. When we decided to phase out our in-house event collector and go with Snowplow, it was sort of a no brainer.
Unlike VigLink, our data is not "big data." It's marked by high dimensionality rather than number of rows. What's really important for every team is getting a complete view of the business: from the very top of the funnel right on down to how customers are interacting with Looker.
Marrying all of these disparate data sources—Salesforce, PostgreSQL, MySQL, and Snowplow—into a single data warehouse has proven to be an interesting challenge.
I'm actually using this tool called DataVirtuality, which I mentioned earlier. I think of it as analogous to Looker, but for ETL. It dramatically simplifies the way we can marry disparate data sources.
You can basically use any arbitrary relational data store as your warehouse, and you can connect this tool to any relational database. It also has connectors to various APIs. So, basically, one can join across random databases, across a database and an API, and then pipe the result sets into a data warehouse.
Q: We've seen a lot of teams that are using both Hadoop and Redshift in tandem. When have you seen this approach implemented and what use cases might be better for Hadoop versus Redshift?
Tina, Redshift: Definitely. We see a lot of customers use Hadoop in conjunction with Redshift, actually.
Oftentimes, what happens is customers will use Hadoop for transformations, for ETL. Then, when they want to do structured queries using SQL relational, they choose Redshift.
We see Hadoop as ETL into Redshift as a very, very common pattern. We also generally see people using Hadoop when they want to do something that isn't meant for a structured data warehouse, so things like machine learning, data modeling, things like that, and, of course, transformations.
That being said, we have customers who do ETL. They load it into Redshift, and then they transform in Redshift, because we actually are great at doing really fast aggregations, for example.
The biggest difference between Hadoop and Redshift is, really, do you want something that's structured SQL, or do you have analysts who can run Pig and MapReduce jobs and want to do something a little bit more complex and flexible? I think that's about it.
Scott, Looker: I agree with Tina. Relational databases simply aren't suited for unstructured data. Take, for instance, Natural Language Processing against raw text documents: you're not going to be able to read this type of data into Redshift. This holds for a lot of machine learning problems that people tackle.
Q: Do you have any tips for people who've got MongoDB data that they're trying to move into some type of SQL schema format? Any common architectures you see, any pitfalls you can advise us on?
Scott, Looker: I think your strategy ought to be to just dump your unstractured data into an Elastic MapReduce cluster, maybe write a Pig job that sort of flattens the data, and then copy it into Redshift. We also have a partnership with a company called Alooma; their tool specializes in MongoDB to Redshift ETL.
Tina, Redshift: On the more general question of schema migration, we have many customers who migrate from traditional, on‑premise systems into Redshift, and, often times, the big challenges are the devil is in the detail so, "I'm used to using this particular system, and I have these particular schemas for these particular queries, and how do I port that over to Redshift without it being this huge nightmare for me?"
One high‑level point is you really should start with the schema that you have and try it with Redshift first and see how that goes.
Another thing to realize is we are a columnar data store, so if you're using a row‑based system, for example, they're quite different. The way that this system is architected is completely different, and so you're going to have to, unfortunately, do work to understand our best practices around things like distribution keys and sort keys, and how do you actually write queries.
We have a database engineering team that is focused on helping customers, so if you have a particular problem, you're using Redshift and you're trying to figure out how to best tune your queries. In addition to our documentation, we can take a look at your cluster, and our database experts, who are very familiar with all the different systems, on‑premise systems, RDS, NoSQL, can help you see what you could be doing better with your cluster, or what you could be doing better with your queries. Definitely, reach out to your rep and reach out to my team to get that kind of feedback.
In general, the most important thing is to really read‑up on our best practices and our documentation, because that will explain all the tuning that you need to do to make Redshift as pro forma as possible.
Things to watch out for, if you're coming from a row‑based system, are commits. We see that it's a huge issue. Redshift is designed for batch processing. Transactions are serial in nature, so if you are used to doing a bunch of commits, often you'll move to Redshift and say, "Well, I thought this was going to be super fast."
It is super fast if you're batching your transactions as much as you can, but if you have a bunch of concurring commits, all of a sudden, they're just going to be back‑logged. That's one thing I see come up over and over again, and so you just have to re‑architect around that, which my team is always happy to help you, or our system integrators are also very good at that.
Q: Is Looker more efficient when connected to a star schema in Redshift, or is it efficient directly querying using LookML rather than creating a big fact table with all the dimensions in the star schema?
Tina, Redshift: We see customers successful with star and snowflake schemas in Redshift. We see them successful having a denormalized schema. My recommendation is start with a schema that you have; try it out. Then, in your POC, you can evolve it over time and see what works better.
I think there's some simple trade‑offs here. With a star schema, the nice thing is that managing tables, things like vacuum and analyze, because your tables are smaller, are easier. But then you have to do joins.
You have to be very careful about tuning to have the right distribution keys so you get manageable joins and don't have to go across the network, because we are a cluster data warehouse, for those who are not familiar with Redshift.
If you have a big, wide table with everything, the nice thing is you don't have to worry about the joins. But then you want to vacuum that table. It might take a while. You have to maintain, sort, analyze, and all that. I think it's a trade‑off. But we've seen both used very successfully, so I would try what you have first and then go from there.