What is Presto Hadoop?
Presto is an open-source distributed SQL query engine that can be placed on top of a wide variety of data sources, from Hadoop distributed file system (HDFS) to traditional relational databases as well as NoSQL data sources such as Cassandra. Presto is particularly equipped to perform multiple concurrent interactive queries against a data source.
Presto was developed at Facebook in Fall 2012 as a replacement to Hive, and Facebook still uses the software to query their 300 petabyte Hadoop cluster filled with data generated from their 1 billion-plus users. Other prominent technology companies such as Netflix, Dropbox, and Airbnb. There is therefore a strong developer community behind the project.
Teradata has a dedicated team of over 20 developers contributing open source code to Presto. The company supports a multi-year commitment to increasing Presto adoption in the enterprise. However, the documentation is limited and is best suited for developers, which means you will need engineering resources to setup and maintain your Presto Configuration.
Why choose Presto?
The developers that built Presto focused on providing interactive querying to their users, which meant decreasing the amount of time it took for a query to return. Therefore, Presto is a great choice as the backend for business intelligence tools such as Looker because the engine enables queries to return very quickly.
Presto has many integrations with 3rd-party data sources, and is able to connect to Kafka, Redis, H3, S3, Hadoop, NoSQL, MySQL, Postgres, among others.
Presto uses standard ANSI SQL to query different data sources, while other SQL Engines such as Hive do not fall under the ANSI standard (Hive uses a querying language similar to SQL known as HiveQL).
What is Presto great for?
Interactive ad hoc querying large datasets
On denormalized data sets that require few joins, Presto approaches interactive query speeds, which means that queries with Presto return on the order of seconds to minutes, rather than minutes to hours. This is largely due to Presto’s unique architecture which is performed completely in-memory (rather than other SQL Engines that use MapReduce and write intermediate steps to disk).
Organizations with disparate data sources
For companies that cannot consolidate all of their data into one centralized store, Presto’s data federation capabilities can create a unified query layer that enables you to blend your data across different data sources together.
Organizations with star schemas
Due to Presto’s in-memory architecture, extremely large queries across fact tables can tend to overwhelm the compute engine. This makes Presto a great choice for inexpensive joins within a star schema between a fact table and a dimension table.
Presto is open source so the software is free for download. However, it’s important to note that the total cost of ownership for Presto will be high, since engineering resources are necessary to deploy and maintain the software, as well as manage the hardware if deployed on premise. Managing and deploying the servers will have a variable cost.
Presto’s architecture is more similar to traditional analytical MPP database architectures than other SQL Engines such as Hive, given that all of Presto’s computations are performed in memory and don’t use MapReduce to compute data. This means, instead of writing intermediate operations to disk, like MapReduce does, Presto executes compute steps completely in memory and pushes the intermediate results across the network. This architecture dramatically increases query-speed, while limiting query throughput to a degree.
Executing a query in Presto
When a SQL query is issued in Presto, the query is sent to the Coordinator. The coordinator is responsible for breaking the query into logical steps and forming an optimized query execution plan. One the coordinator forms an execution plan, it then distributes instructions to the worker nodes closest to where the physical data is located, which then stream data from physical storage into memory, where it’s then pipelined across the network, and returned to the client.
Types of data
Presto is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Presto itself does not store data.
Range of row (quantity)
Presto is a petabyte-scale SQL engine, depending on the data stores it is connected to. For instance, if you are using a Hadoop cluster, the amount of data that you will be able to store is much larger than if you are connecting to a postgres database. On the other hand, because Presto writes ANSI SQL, you can leverage many data sources at once, making the data volume Presto can handle very large.
Implementing and using Presto
Presto can be downloaded for free from Github or from Teradata.
Once downloaded, Presto has a guide to deploying the software on your server.
Process for new data
Given that Presto is a SQL Engine and not a database, Presto doesn’t actually store any data, which means the process for adding new data just means making Presto aware of an existing data source that you have. This means either connecting Presto to this existing data source, or piping the data into a data source that Presto is able to query.