We recently had the opportunity to present a talk on the MapReduce programming paradigm for the Raleigh-Durham chapter of Girl Develop It at Syncfusion’s office in Morrisville, NC.
The primary goal was to walk through a tutorial implementation of MapReduce to inspire participants in their own efforts.
This blog presents the main segments of the talk. If you would like to watch a video of the entire session, you can find the recording on our YouTube channel.
The need for a cluster to store and process data
Data is becoming more plentiful and, in order to process and store such data in a scalable manner, we need to leverage the resources of a cluster of machines instead of relying on a single machine, however powerful that single machine may be.
Working with a cluster of machines does pose its own set of problems. The MapReduce programming paradigm was used by Google to successfully process data across a cluster of machines over a decade ago. They documented their experience in several papers:
These papers were later used as a guide to implement MapReduce in Apache Hadoop.
Our goal is to take a simple scenario and, by considering how to implement a processing solution in a scalable manner, arrive at the MapReduce programming model. The hope is that such a walk-through will lead to a deeper understanding of the internals of MapReduce.
Scala syntax introduction
The sample code provided is written in Scala. The essential syntax can be found on GitHub: github.com/danjebaraj/hadoopmr/blob/master/scalaquick/src/quickstart.sc.
The Scala worksheet linked above contains all the Scala syntax we would need to get started. Note that the objective of the talk and associated code is not really to write idiomatic Scala. In fact, we avoid using Map and Reduce and instead stick to using loops to build toward the MapReduce programming model.
Running the sample Scala worksheet requires you to install the Scala IDE from Scala-ide.org. The Scala IDE is also useful when working with the other samples, but any other IDE, such as Eclipse or IntelliJ, will also work well.
The challenge
We have sample data in the following table.
Person’s name | Favorite beverage | Number of cups person drinks per day |
Brianna | coffee | 3 |
Cameron | milk | 5 |
Thomas | milk | 4 |
Wyatt | coffee | 5 |
The goal is to take the data in this table and compute the average number of cups of each beverage consumed per day across the data set.
This is a simple problem that is along the same lines as the commonly used WordCount problem. Our motivation for using this data was that it readily mapped to gathering and computing the same information across people in the audience.
The implementation
Our first attempt at computing summaries uses a simple, single threaded solution. These are the steps we follow:
- We loop through and transform the data, removing data that we don’t need. In this instance we don’t need the name of each person as we are simply interested in beverages and cups consumed.
- The resulting data set is a list of tuples with each tuple containing a beverage name and cups consumed.
- We group the data by beverage.
- The grouped data is then summarized per group to produce results.
The complete code for these steps is hosted on GitHub: github.com/danjebaraj/hadoopmr/blob/master/cupcount/src/main/scala/Driver.scala.
Can we make the code parallel?
In order to make the code parallel, we must consider the issue of shared state. In general, shared state is trouble and should be avoided where possible. It cannot, of course, be totally avoided, but if we are deliberate in the way we design our program we can push most of the complexity down into libraries that handle most of the heavy lifting. The key idea is to make the program concurrent by design. For more details on concurrency, refer to this article on The Go Blog.
To illustrate issues with shared state, we have a simple sample available at github.com/danjebaraj/hadoopmr/blob/master/sharedstate/src/main/scala/Driver.scala. You can safely skip this sample if you are already familiar with the issues surrounding shared state.
We use the Akka library to break up the program into concurrent units. Akka works through a message processing paradigm. The core idea is that a program can be broken into independent units of work that communicate by sending each other messages with no state shared between them. In Akka terms, these concurrent units are referred to as actors. Actors can communicate through messages but do not typically share any state.
Akka makes it easy to create a pool of actors to service requests by using more than one instance of an actor. We use the RoundRobinPool for this purpose.
We explain Akka fundamentals using a sample that breaks up a large list of numbers into several partitions that are then summed by different units of work. The result is then summarized for ultimate presentation to the user.
Code for this section is available on GitHub: github.com/danjebaraj/hadoopmr/blob/master/akkaquick/src/main/scala/Driver.scala.
A parallel implementation of cup count using Akka
We can easily make the first operation, the transformation of data by dropping names, parallel. We essentially break the data into blocks and send each block to be worked on independently by Akka actors. Each Akka actor, named FileProcessor in this instance, will process the blocks that it is sent and then report back to the sender with transformed information. Since the data is stored across files, we simply treat each file as a block for convenience.
The processed blocks are then received by an actor named Aggregator. The Aggregator groups and summarizes the data just as it would as a single threaded program:
github.com/danjebaraj/hadoopmr/blob/master/distcupcount1/src/main/scala/Driver.scala.
Can we do better?
You will notice that the grouping and summarization operations happen within a single thread in our first attempt at a parallel model. We can easily make the summary calculation itself parallel since aggregating all the data for, say, tea does not have anything to do with the data for coffee. This is what we do in our second take. We group the data as before and then pass it on to a group of actors for the calculation of summaries. Here, the summarizer actor simply computes summaries for each group and reports them back:
github.com/danjebaraj/hadoopmr/blob/master/distcupcount2/src/main/scala/Driver.scala.
The final iteration: MapReduce
At this point, we have a working implementation of MapReduce.
Our sample | MapReduce |
The Map operation transforms the data from the available form into the desired form. In this instance, we perform a one-to-one mapping, creating exactly one output record per input record. This, however, is not a requirement. We can just as easily generate more than one record per input record. The only requirement is that there is a key and a value per record. | Map operation (user-provided code). The MapReduce framework handles the partitioning of data into blocks. |
The code then groups by key. This internally involves sorting. The grouped data is then provided to different aggregators, or reducers as they are known under MapReduce. | Shuffle: developer.yahoo.com/hadoop/tutorial/module4.html#dataflow. |
The reducer code receives all the data per key (group) and aggregates it into the desired value. | Reduce (user-provided code). |
The final code can be found on GitHub: github.com/danjebaraj/hadoopmr/blob/master/distcupcount3/src/main/scala/Driver.scala.
Other important elements
Minimizing network traffic
Apache Hadoop’s implementation of storage (the Hadoop Distributed File System, HDFS) is designed to minimize network traffic. Chunks of data are replicated on multiple machines. When a MapReduce program is initialized for execution, the code that performs the Map operation is run on machines where the data is stored to the extent possible. Running locally may not always be possible. In such instances, data locality is still considered when scheduling the Map operation. The Map operation will be scheduled to run on a machine connected to the same rack as a machine that has a copy of the data before considering a machine on another rack. This allows the system to minimize the movement of data across the network, thereby sidestepping a major bottleneck.
Handling node failure
In the real word, especially on a large cluster, machines can fail quite often. The Hadoop implementation of MapReduce does a lot of work beneath the surface to recover from failure by rescheduling jobs on other nodes where the same data is available. Additionally, if a node takes a long time to complete a job (slow nodes are known as stragglers), the framework can speculatively schedule execution on another node and cancel the original job if the newer job request completes earlier.
Complete sample code and slides
Sample code: The complete set of samples is available at github.com/danjebaraj/hadoopmr. You need to install a recent JDK distribution and SBT to get started.
Slides: The slides used with the presentation can be viewed at www.slideshare.net/DanielJebaraj/mapreduce-succinctly. The talk itself was originally titled “Hadoop Succinctly,” but it focused more on MapReduce than Apache Hadoop.
Run the samples and take a look at the code to get a solid working foundation of how MapReduce works under the hood.
Interested in Apache Hadoop?
If you are interested in working with Apache Hadoop on Windows, be sure to check out the Syncfusion Big Data Platform designed for Windows at www.syncfusion.com/products/big-data.