What’s Hadoop and how does SQL fit in?
Hadoop is an open-source framework that came on the scene in the early 2000s for storing and processing really big data. It uses parallel processing on commodity hardware to distribute computation across huge clusters of machines.
Hadoop jobs are usually written in Java. But because this can be complicated and analysts largely still prefer to write SQL, several computation engines have been developed that allow you to write SQL, and have it run on a Hadoop cluster.
Hadoop’s core strength lies in its modularity. Traditional relational database management systems (RDBMSs) and MPP data warehouses bundle their file storage, compute engines, and data management systems together in one platform. With Hadoop, all of these components operate separately and independently of each other. Hadoop as a software category is massive, and the entire ecosystem consists of dozens of software vendors, support/service providers, and applications.
Unlike many analytical data warehouse offerings, Hadoop’s software is designed to run on commodity hardware, which means the entire software architecture also operates separately and independently of the hardware.
These two features, taken together, mean that a skilled DBA or DevOps team can take steps to customize every part of their data stack with Hadoop. This means that if you possess a Hadoop cluster, it’s going to be important to know the different benefits and drawbacks between vendors.
What are advantages of using SQL on Hadoop?
Cost effectiveness and flexibility
One of Hadoop’s major advantages is that the software is open source and designed to run on commodity hardware, or non-proprietary hardware that is not tied to any one software provider.
This feature is significant because it allows organizations to add more machines to their Hadoop cluster at a predictably low cost. Organizations also have the ability to swap out different Hadoop technologies to optimize their processes at any given moment without needing to replace their whole hardware stack.
For example, organizations interested in provisioning their Hadoop cluster as a data warehouse can add a SQL engine such as Hive to query their data via SQL. They can then swap Hive out for another SQL engine such as Presto or Spark, without the need to change their hardware. This flexibility is much greater than with other on-premise data warehouse vendors because of the interoperability of Hadoop-based systems.
Hadoop users don’t need to declare a schema when they store data. This makes the Hadoop Distributed File System (HDFS) a cost effective way for organizations to store massive amounts of unstructured data, and declare a schema only when they want to analyze the data. This eliminates the need to declare a potentially limiting schema up front when writing data to the cluster.
Hadoop’s infrastructure extends far beyond SQL, and is compatible with many different programming languages. It’s preferred by many developers interested in performing machine-learning analysis and running multiple programming languages in parallel on the same cluster.
Prior to the development of SQL Engines, writing code for execution on Hadoop required specialized knowledge. However, the advent of SQL engines on Hadoop allows organizations to run ANSI-compliant SQL against their Hadoop clusters. The introduction of these technologies allows business users to connect any SQL-based reporting tool to their Hadoop cluster.
Reliability and performance
Hadoop, and specifically the Hadoop Distributed File System (HDFS), is commonly deployed within enterprise data stacks for two reasons: HDFS is designed to be remarkably fault-tolerant (data within the cluster will still be accessible despite machines in the cluster failing) and Hadoop’s MapReduce framework allows developers to process extremely large amounts of data.
Popular SQL on Hadoop engines
“When they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but rather bigger systems of computers.” - Grace Hopper
Hadoop’s creators pioneered several concepts that have driven innovation in the big data space since.
Hadoop was designed to leverage linear horizontal scale, which means adding more machines of the same size to your database system, rather than making each individual node larger.
This principle not only ensures that Hadoop users can easily and effectively scale their Hadoop cluster, but also that the Hadoop cluster will remain performant even and especially as it scales out in size.
No data movement
Hadoop was initially created as an unstructured data lake where raw data could sit before being piped into a RDBMS. This was a boon for developers who could write Java, because they could access the unstructured data and perform analytics on it directly, without having to pipe the data into a relational database. Business users, however, had to wait for the data to be cleaned and moved into a relational database where they could access it with SQL before using it.
With the advent of powerful SQL engines for Hadoop, analysts and business users can access data from the cluster directly, rather than waiting for the data to be moved. Analysis here can be done closer to real-time, and this limits the need for a dedicated data warehouse for analysts, which can simplify a company’s data architecture.
In older systems, before data could be written into the database, you had to declare a schema that specified how the data in the database would be structured. Creating a schema was a crucial part of the Extract, Transform, Load (ETL) processes that were required to properly write data to a database.
This process was known as schema-on-write, and it was the only method of writing data to a database. Some of the benefits of a schema-on-write framework were that the data in the database was robust (everything that was written was determined to be useful to analysts), consistent (all of the data was structured the same), and clean (accessible to end users).
However, schema-on-write brings with it several significant drawbacks. The person managing the ETL needs to know how the data is going to be used and analyzed before it’s written, which usually results in more limited dataset availability. Data may be thrown out or not collected because there isn’t a clear enough use case for it yet, which prevents historical analysis.
Hadoop and other modern data lakes turn this idea on its head. Instead of declaring how exactly the data should show up in the database, data from external sources is written to a data lake, and transformed only when it’s ready to be read. This new framework is called schema-on-read, because a schema is defined by analysts as the data is read out of the database for analysis.
While the schema-on-read framework requires more upfront work by analysts to understand the data before declaring a schema for analysis, it can offer more versatility and flexibility for analysis, and ensures that no data is thrown out because it can’t be immediately schematized. Schema-on-read also gives analysts the freedom to impose the schema that makes the most sense for the type of analysis they’re doing. They don’t need to fit the analysis to the schema, they can instead tailor the schema to the analysis.
Three major components of Hadoop architecture
The Hadoop architecture can get very complex very quickly, but there are three core components of every modern Hadoop deployment:
- Hadoop Distributed File System (Storage)
- MapReduce (Processing)
- Apache Hadoop YARN (Resource Allocation)
Hadoop Distributed File System (HDFS)
One of the biggest challenges that led to the introduction of HDFS was the need to store vast amounts of a data in a way such that performance wouldn’t degrade as the storage scaled.
Loading data into HDFS
HDFS was created as a solution to distribute individual pieces of data across many individual machines, and index them so they can be retrieved and processed quickly.
Raw data comprised of varying data formats is loaded in HDFS and partitioned into bite-sized files known as blocks. These blocks are then uniformly distributed across the cluster onto individual storage machines called data nodes. The locations of each individual block are then assigned to a specialized node called a NameNode. The NameNode both indexes the locations of blocks across the cluster and manages access to these data nodes.
Replication by a factor of three
HDFS replicates each block three times, and stores copies of each block on separate data nodes. This replication pattern makes the HDFS remarkably fault tolerant, because if a machine fails (which is almost guaranteed to happen at scale), data across the cluster will still be available. This replication pattern guarantees that when a copy is lost, HDFS automatically replicates that copy somewhere else in the cluster, ensuring consistently high availability of the data.
MapReduce is the framework that processes data distributed across many different nodes in HDFS. MapReduce, as its name suggests, is a combination of two phases:
- the map phase, which performs the operation on the individual block files stored in HDFS
- the reduce phase, which combines the results
Pre-MapReduce data preprepation
Before MapReduce can run, the data needed for the MapReduce phases needs to be located and prepared for processing. When a MapReduce task is executed, the data stored across HDFS in blocks is located and the format is determined. This allows MapReduce to process a large variety of file formats stored in HDFS.
Because data from a single file is broken up among blocks, another function called InputSplit is called which splits the data into smaller, defined chunks to be process. Input splits don’t contain data themselves, rather, they use what’s known as a byte offset, which allows MapReduce to determine a logical pattern for breaking the files up.
After the inputs are located and readied, the data is ready for the map phase. During the map phase, the inputs are processed on individual nodes in the Hadoop cluster all at the same time — this is known as a map. All of these maps are then shuffled, which consists of sorting and placing these maps on the correct nodes, for the reduce phase.
Once the data is properly partitioned on the correct nodes, the reduce phase aggregates all of the individual results of a map phase into a single output. It’s essentially a summary operation that combines all of the results of this data into one answer.
Example of a MapReduce operation
Let’s say for example, that this article on Hadoop was read into HDFS as a file, and you wanted to run a MapReduce task to figure out how many times the word “Hadoop” was mentioned. First the file would be read into HDFS, and broken up as individual block files. These blocks would then be partitioned into individual input splits, which ensures that the words in this article are not being cut off (if one instance of Hadoop was split between two block files as “Ha” and “doop”, that instance wouldn’t be calculated).
Once the input splits are calculated, each node on the cluster is assigned an input split, and individually counts the instances of “Hadoop” within the input split that they’re assigned. The results of these individual map phases, represented as counts, are then shuffled, and aggregated during the reduce phase of the operation to return a full count of the instances Hadoop appears in this article.
IBM uses a great example of the census in Rome, where census takers would travel to each city in Rome, take a tally of the individuals located in that city, and return to the capital city, where the results were then aggregated to form a total tally of the population in Rome.
Apache Hadoop YARN — yet another resource negotiator
Prior to the launch of Hadoop 2.0 and the introduction of YARN, MapReduce also managed very basic workload management and job scheduling tasks, which made running non-MapReduce analytic applications on the same nodes very difficult. In fact, architectures that also included analytic applications had to be run outside of Hadoop, requiring the data from Hadoop to be moved to external nodes outside of the Hadoop infrastructure.
The introduction of YARN was significant because it finally allowed these non-MapReduce analytic applications to be run directly on the Hadoop cluster. YARN is able to do this by smartly balancing resources on individual nodes between MapReduce tasks and analytical applications.
Lots of really small files
One of the more famous constraints of Hadoop is that it has issues storing and processing really small files, which are files that are smaller than a HDFS block size (around 64MB). Hadoop is optimized to be able to process a few extremely large files very efficiently, but has a problem with smaller files from both a storage and a processing perspective.
Cloudera’s blog gives a great example that compares a single, 1GB file broken up into 16 64MB blocks vs. 10,000 files of 100KB each. The first instance will create 16 individual map tasks, the second example will create 10,000 map tasks, which will dramatically slow the performance even though the physical storage taken up by these files is the same.
Cloudera then gives two solutions to the small file problem in Hadoop, one of which involves creating a HAR file, and the other that involves creating a Sequence File.
Hadoop can either be deployed as an unstructured data lake and a refinery for data to be moved into a dedicated data warehouse, or as a dedicated data warehouse all on its own, leveraging in-cluster analytics.
Hadoop and a data warehouse
Many organizations choose to implement Hadoop as an unstructured data lake for developers, and use Hadoop’s processing framework as a low-cost ETL mechanism to pipe data into a dedicated data warehouse for analysts, business users, and analytical applications.
Hadoop as the data warehouse
Hadoop 2.0 and YARN, along with the steady growth of ANSI-compliant SQL on Hadoop engines allow in-cluster analytics to run directly on Hadoop. Under this implementation, organizations don’t need to move the data at all — instead compute and analytics are done on top of Hadoop.
The primary benefit of SQL on Hadoop engines is that they allow business users and analysts to run SQL against Hadoop data stores, completely eliminating the need for large scale data movement.