This is the first of a two-part blog post that presents a perspective on the recent trend to integrate MapReduce with Relational Databases especially Analytic Database Management Systems (ADBMS).
The first part of this blog post provides an introduction to MapReduce, provides a short description of the history and why MapReduce was created, and describes the stated benefits of MapReduce.
The second part of this blog post provides a short description of why I believe that integration of MapReduce with relational databases is a significant mistake. It concludes by providing some alternatives that would provide much better solutions to the problems that MapReduce is supposed to solve.
1. What is MapReduce
MapReduce is a programming model and an associated implementation for manipulating very large data sets. It provides a framework in which a programmer can write two simple functions, a mapper and a reducer that perform the actual data manipulation. Applications written using the MapReduce framework are automatically parallelized and suitable for execution on a large infrastructure of connected machines.
2. A brief history of MapReduce
Jeffrey Dean and Sanjay Ghemawat of Google announced MapReduce to the world in December 2004 at OSDI’04, the Sixth Symposium on Operating System Design and Implementation in San Francisco, CA. A copy of their original publication is available for download here. Some associated slideware is available here.
MapReduce had been in use at Google for some time before this publication. Today the MapReduce implementation has been ported to a variety of platforms and some relational database vendors have begun to offer MapReduce extensions to their databases. It is those extensions that are the subject of this article.
You can find a good high level overview of MapReduce on Wikipedia here.
3. A simple MapReduce example
The best way to understand MapReduce is to look at a simple example. The example provided below is one that can be executed easily in SQL and has been chosen because it provides a good description of MapReduce and of how it relates to conventional SQL.
A user has a data set consisting of three hundred million rows of data that contain census information. A row of data contains an individuals name, age, ethnicity, address (including zip code), sex, individual income and so on.
The user would like to aggregate and sort this information to produce an output that lists the zip codes in the country and the associated average individual income.
Assuming a reasonably self-explanatory schema, one could express this in SQL as follows.
SELECT ZIPCODE, AVERAGE(INCOME)
GROUP BY ZIPCODE;
The MapReduce solution to this problem would be something similar to the following pseudo-code:
MAP (KEY, VALUE):
// KEY: NAME OF THE INPUT FILE, CENSUSDATA
// VALUE: THE INPUT DATA, 300 MILLION ROWS WORTH
FOR EACH INDIVIDUAL IN VALUE
EMIT (ZIPCODE, INCOME);
REDUCE (KEY, ITERATOR):
// KEY: A ZIP CODE
// VALUES: A LIST OF INCOMES FOR THAT ZIP CODE
LOCAL COUNT = 0, TOTAL = 0;
FOR EACH INCOME IN ITERATOR
COUNT = COUNT + 1
TOTAL = TOTAL + INCOME;
EMIT (ZIPCODE, TOTAL/COUNT);
To understand what that MapReduce pseudo-code is doing, consider how this gets executed.
The MapReduce framework is provided with a representation of this pseudo-code (in some language) and an input file with three hundred million rows of data. Also, for the purpose of this description, let us assume that there are one-hundred (100) nodes that will participate in the execution of the map-reduce program.
The picture below shows this process.
Each node in this will have a running MapperJob. The framework (in some manner) will take the input file with three hundred million rows and carve it up into one hundred pieces.
In other words, each piece will contain about three million rows and each MapperJob will process three million rows of data.
The MapperJob will execute the pseudo-code function MAP and emit one row of data for each row of input data. In effect, the MAP operation is semantically equivalent to the database “projection”.
The projection would be a ZIPCODE and an INCOME. This would result in one hundred temporary files that were generated by each of the MapperJobs.
The framework now consolidates these one hundred files into a single file and then proceeds to sort it by the ZIPCODE. The resulting file would contain the ZIPCODE and INCOME sorted by ZIPCODE. The resulting file would have three hundred million rows.
Each node in this execution will have a running ReducerJob. The framework (in some manner) will take the above intermediate file and redistribute the data to the ReducerJobs. All data pertaining to a single ZIPCODE would be sent to the same ReducerJob.
One simple way to accomplish this would be to take the ZIPCODE and perform the following operation on it
locus = mod (zipcode, totalnodes)
where TOTALNODES would be one hundred (100) in the above illustration. Assuming that the nodes were numbered from zero (0) to ninety-nine (99), the value computed for LOCUS above would indicate which node should receive the data pertaining to the ZIPCODE.
This operation can be performed on non-numeric data as well through the use of some deterministic hash function. In case the ReduceJob was aggregating on a non-numeric datum, one could compute LOCUS as
locus = mod (hash(datum), totalnodes)
Each ReducerJob would therefore receive a chunk of data relating to a specific ZIPCODE and could proceed to compute the total income and the number of individuals and therefore the average income.
4. So, what’s new about this?
The preceding description of MapReduce to illustrate the steps involved in executing a specific data analysis operation should lead you to ask, “what’s the big deal”?
Analytics companies like Netezza and Teradata had mechanisms to MPP’ize the execution of queries such as the one above. So, what’s new about MapReduce and why did Google create this framework?
Since MPP databases exist, that could parallelize these data manipulation operations, it seems inconceivable that Google created this framework. But, this wasn’t a case of the philosophy of NIH (Not Invented Here).
4.1 Query Fault Tolerance
To meet the anticipated problems of scale, engineers at Google accurately determined that they needed parallel execution. If they had attempted to perform the kinds of operations that they wanted to perform on serial programming infrastructure, performance would not be satisfactory. This is largely because the speed of processors, disks and interconnects could not scale as fast as the size of input data and computation complexity.
Further, as part of the need to parallelize their infrastructure, Google chose to use inexpensive hardware (COTS) with inexpensive disk drives. One should also therefore be cognisent of the failure trends in these kinds of systems and one telling paper (from Google) can be found here. Another good description of MapReduce providing a description of failure rates can be found here.
Basically, while disk capacities were growing rapidly (because of increases in data recording density), failure rates (measured in failures per GB per year) were remaining constant.
As a result, at “Google Scale”, it was unlikely that any reasonable sized database query would complete in less time than it would take to record a disk failure. Conventional disk failure strategies like RAID5 are known be failure prone. One cause of such failures is inherent in the failure rates and makes it highly likely that an unrecoverable read error will occur during the RAID5 raid-stripe reconstruction (see description here). The other has to do with the inherent nature of RAID5 where parity validation can cause data corruption (during parity verification, the controller reads data and parity and ensures that parity matches data. But, in the event that parity does not match data, many controllers will attempt to fix the problem. This “fixing” may introduce silent data corruption).
For these reasons, engineers at Google took great pains to build in a Fault Tolerance mechanism into the MapReduce framework. If a node failed for some reason, the job would be restarted on another node and the MapReduce program would complete successfully.
At the time when Google developed this framework, other relational database systems that could scale to these data volumes did not provide robust “Fault Tolerance” and Google’s engineers correctly determined that query failures would be frequent.
You can read all about how MapReduce achieves this in section 3.3 of the MapReduce paper.
4.2 The ability to perform operations not possible in SQL
The claim has been made by many that MapReduce can be used to perform operations that cannot be performed in conventional SQL. Three specific examples of this kind of operation are described below.
4.2.1 SESSIONIZATION in SQL/MR and SQL
The Aster Data presentations of SQL/MR make the claim that SQL/MR can be used to process queries that are not even expressible in SQL. You can listen to the context around that statement and the associated presentation at the Aster Data web page here. The comment that I highlight above is approximately four minutes into the presentation (at about 03:44 to 04:15 of the 11:42 presentation).
You should listen to the entire context because, in fairness to Aster Data, they also make the case that SQL/MR makes queries easier to write and understand, and there is definitely merit to that claim.
Around four and a half minutes into the presentation (Aster Data presentation on SQL/MR, 04:30 of the 11:42 presentation), a description is provided of the sessionizing problem in clickstream analysis.
The SQL/MR representation of this solution is
select userid, ts, pageid, sessionid
from sessionize (
partition by userid
order by ts
And the claim is that this would be long and complicated in pure SQL.
For those of you who don’t know what sessionization is, you should listen to the description of the problem on the presentation. In a nut-shell, a system collects clickstream information including a userid, a timestamp, a pageid (URL). A session is considered to be a sequence of clicks that are separated by no more than a certain specified timeout (here, 60 seconds). This process emits a series of “sessionized” clickstreams.
The claim is made that this is a problem whose solution would be “long and complicated” in SQL and I dispute that claim. (If you don’t agree with me, then try and write this in SQL; I will be talking more about this claim in the second part of this blog post).
4.2.2 SQL/MR and nPath
A second presentation by Aster Data provides a description of the nPath built in MapReduce construct. You can listen to that presentation here. SQL/MR’s nPath is a basic construct that is used to construct an open graph based on relational data. The example presented, also with clickstream data is to find a sequence of rows where a user performed four specified actions:
- Visited a designated page (home page)
- Viewed an auction
- Viewed the profile of the seller and finally
- Placed a bid on the auction
Here again, the claim is made that SQL/MR’s nPath allows a user to “transcend SQL’s limitations with respect to ordered data”. That claim is made about 40 seconds into the 12:48 presentation. Oddly enough, the examples presented do not really appear to be things that cannot be done in SQL.
This kind of graph construction is something that is absolutely do’able in SQL and the SQL isn’t even that complicated.
4.2.3 Google’s example of Distributed Grep
Google provides the example of a MapReduce implementation of grep in the original MapReduce paper. The description below is from section 2.3 where you will also find examples such as “Count of URL Access Frequency” and “Reverse Web Link Graph”.
“Distributed Grep: The map function emits a line if it matches a supplied pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output.”
This concludes the first part of this two part blog post.