This is the second of a two-part blog post that presents a perspective on the recent trend to integrate MapReduce with Relational Databases especially Analytic Database Management Systems (ADBMS).
The first part of this blog post provides an introduction to MapReduce, provides a short description of the history and why MapReduce was created, and describes the stated benefits of MapReduce.
The second part of this blog post provides a short description of why I believe that integration of MapReduce with relational databases is a significant mistake. It concludes by providing some alternatives that would provide much better solutions to the problems that MapReduce is supposed to solve.
5 The case for combining MapReduce and Relational Databases
As described in the previous section, the claim has been that MapReduce allows users to perform operations that cannot be performed easily in SQL, and provides the user with some level of query Fault Tolerance.
5.1 Query Fault Tolerance
There is no doubt in my mind that MapReduce does in fact provide query Fault Tolerance and to the best of my knowledge, today’s MPP databases provide limited support for this.
So long as results from a long pipeline of operations are not stored into some transaction consistent persistent storage, it is not possible to “restart where you left off”.
At this writing, I do not believe that Aster Data provides true query Fault Tolerance and the only database implementation that claims anything like this is HadoopDB.
The database independent implementations of MapReduce (such as the one being used at Google) clearly provide a level of fault tolerance that is superior to that which they could obtain with a relational database.
5.2 The ability to perform operations not possible in SQL
This is most certainly the case if one makes the clarifying assumption that performing an operation in procedural SQL (stored procedure, row-by-row-processing by any name) does not scale and is not a viable solution.
Ironically, of the examples that I have seen being presented, the only one that actually seems to hold water is the example of sessionization. So, I make the following claims.
Claim: It is not possible to perform click stream sessionization in SQL.
Claim: It is not possible to perform operations that produce graphs of potentially unbounded length in SQL.
Consider the case of ordered clickstream data with a schema such as (USERID, TS, PAGEID) and the goal is to identify a session as a group of clicks by a user with an inter-click timeout less than 30 seconds.
This session could have arbitrary length; some sessions may have two clicks others may have two hundred. Some may be just one click long.
In order to generate a session using non-procedural SQL, one would have to perform a self-join of the clickstream table. For example, all sessions of length three could be determined using the following query:
select a.userid, a.ts, a.pageid, b.pageid, c.pageid from clickstream a, clickstream b, clickstream c where a.userid = b.userid and b.userid = c.userid and b.ts < a.ts + 30 and c.ts < b.ts + 30
This code is by no means perfect but it serves to illustrate the concept. Given that graphs of this kind may have arbitrary length, the SQL would not be possible to write without some procedural construct.
But, what this does also show is that fixed length graphs can be generated and analyzed with simple SQL.
Claim: It is possible to perform operations that produce graphs of deterministic length
It is for this reason that I believe that SESSIONIZE makes a strong case for MapReduce and the illustrations of nPath do not.
Finally, the Google examples such as distributed grep can be most definitely performed more efficiently within a database as scalar functions implemented in the UDF (User Defined Function) framework.
Databases like ORACLE even provide a REGEXP_LIKE operation that will perform just the kind of operation that is being implemented in MapReduce.
6 Why MapReduce is a bad solution
The success and adoption of SQL and Relational Databases owes a great deal to the fact that SQL is a high level language describing the data and operation and leaving the details to the implementation. This achieved standardization and some amount of portability.
Over many years, we have learnt that separation of schema and data is good. The ability to define database operations in terms of high level languages is good. MapReduce is a step in exactly the opposite direction. While I admit that it is a solution to a valid problem, I submit to you that it is the wrong solution. A good article describing why MapReduce is a major step backwards is an article by David DeWitt.
While David makes a good case, he provides no solutions beyond the promise that,
We fully understand that database systems are not without their problems. The database community recognizes that database systems are too “hard” to use and is working to solve this problem. The database community can also learn something valuable from the excellent fault-tolerance that MapReduce provides its applications. Finally we note that some database researchers are beginning to explore using the MapReduce framework as the basis for building scalable database systems.
As enough has been written about why MapReduce is a bad solution, I’m going to move on to some recommendations for a better solution.
7 If MapReduce is a bad solution, what is a better solution?
The problems that MapReduce solve have been divided (in this blog post) into two categories; Query Fault Tolerance and the ability to perform operations not possible in SQL.
7.1 Query Fault Tolerance
Let’s face it, today there aren’t many good solutions to the query Fault Tolerance problem. Vendors who are using conventional RAID mechanisms can handle single drive failures and seamlessly recover without interrupting the query. Vendors who have specialized hardware and whose architectures resemble the expression “one disk per node” will find this harder to do. But, as databases get larger and larger, this problem is being addressed, and FAST! Node failure will still be an issue and vendors have to provide solutions to this. The solution should be one where the query continues to execute (note: start from the beginning under the covers is cheating and not acceptable) to completion.
7.2 The ability to perform operations not possible in SQL
Of the examples that I have seen people describe in support of MapReduce, the common theme is the construction of graphs. Most people I have spoken with realize that the examples like grep are purely for illustrative purposes only.
Jerome Pineau makes the following statement in his blog post on MapReduce
“In it, I learned that graph theory problems (think: travelling salesman) were well suited to MapReduce but not SQL. Examples included social networking (LinkedIn), Government (intelligence), Telecom (routing statistic), and retail (CRM, affinities), and finance (risk, fraud). Pretty much anything that can be modeled using interconnected nodes. But a connection from a node to another is really a “relation”, and so clearly well suited to a “relational engine”. So I might have missed something.”
Jerome is right, the connection from a node to another is a relation but SQL is customarily involved in “queries” with a deterministic number of such “relations”. Graphs could have non-deterministic numbers of such relations and that is one of the things that make them hard.
I believe that what SQL needs is not MapReduce but a standards based mechanism to handle connected graphs or arbitrary lengths.
Over the course of time, SQL2, SQL3 and more recently SQL 2003 have made extensions to the conventional relational algebra to force the notion of ordering and the relationship between tuples (PARTITIONS, OVER etc). The SESSIONIZING problem is a good example of one that should be addressed as part of the SQL native constructs that provide support for processing graphs of arbitrary lengths.
One possible place to in the syntax that already has the beginnings of constructs that could be extended to define a graph is the definition of a partition and a window (introduced in SQL2003). But, one thing that the SQL language still lacks is the ability to project (in a row), a value that is extracted from a preceding or following row.
7.3 A proposal for a COLUMNVAL() function
One useful extension to the SQL language that would allow one to handle situations such as the sessioninzing problem would be a function (that I am calling COLUMNVAL) which would work as follows.
COLUMNVAL ( COLUMN, OFFSET )
COLUMNVAL is a function that will return the value of the specified COLUMN (argument 1) in a row offset by OFFSET rows from the current row. The return type of the function COLUMNVAL is the same as the datatype of the COLUMN.Consider the following data in a table called clickstream > select * from clickstream; userid | timestamp | pageid --------+---------------------+-------- alpha | 2009-09-01 14:01:03 | page_1 alpha | 2009-09-01 14:01:08 | page_2 alpha | 2009-09-01 14:01:17 | page_1 alpha | 2009-09-01 14:01:49 | page_2 alpha | 2009-09-01 14:01:55 | page_3 bravo | 2009-09-01 14:01:07 | page_4 bravo | 2009-09-02 15:03:07 | page_4 charlie | 2009-09-01 14:01:03 | page_1 charlie | 2009-09-01 14:01:08 | page_2 charlie | 2009-09-01 14:01:17 | page_1 charlie | 2009-09-01 14:01:49 | page_2 charlie | 2009-09-01 14:01:55 | page_3 and the query select userid, timestamp, pageid, case when isnull ( columnval ( timestamp, -1 )) then userid || '_' || timestamp when columnval ( timestamp, -1 ) + 30 > timestamp them columnval ( sessionid, -1 ) else userid || '_' || timestamp end as sessionid from clickstream group by userid; that would produces the outputuserid | timestamp | pageid | sessionid --------+---------------------+--------+------------------------------ alpha | 2009-09-01 14:01:03 | page_1 | alpha_2009-09-01 14:01:03 alpha | 2009-09-01 14:01:08 | page_2 | alpha_2009-09-01 14:01:03 alpha | 2009-09-01 14:01:17 | page_1 | alpha_2009-09-01 14:01:03 alpha | 2009-09-01 14:01:49 | page_2 | alpha_2009-09-01 14:01:49 alpha | 2009-09-01 14:01:55 | page_3 | alpha_2009-09-01 14:01:49 bravo | 2009-09-01 14:01:07 | page_4 | bravo_2009-09-01 14:01:07 bravo | 2009-09-02 15:03:07 | page_4 | bravo_2009-09-02 15:03:07 charlie | 2009-09-01 14:01:03 | page_1 | charlie_2009-09-01 14:01:03 charlie | 2009-09-01 14:01:08 | page_2 | charlie_2009-09-01 14:01:03 charlie | 2009-09-01 14:01:17 | page_1 | charlie_2009-09-01 14:01:03 charlie | 2009-09-01 14:01:49 | page_2 | charlie_2009-09-01 14:01:49 charlie | 2009-09-01 14:01:55 | page_3 | charlie_2009-09-01 14:01:49 Similarly, the queryselect userid, timestamp, pageid, sessionid, case when isnull ( columnval ( sessionid, -1 )) then 'YES' when sessionid <> columnval ( sessionid, -1 ) then 'YES' else 'NO' end as yesno from ( select userid, timestamp, pageid, case when isnull ( columnval ( timestamp, -1 )) then userid || '_' || timestamp when columnval ( timestamp, -1 ) + 30 > timestamp them columnval ( sessionid, -1 ) else userid || '_' || timestamp end as sessionid from clickstream group by userid );
would returns the following resultsuserid | timestamp | pageid | sessionid | yesno --------+---------------------+--------+-----------------------------+------- alpha | 2009-09-01 14:01:03 | page_1 | alpha_2009-09-01 14:01:03 | YES alpha | 2009-09-01 14:01:08 | page_2 | alpha_2009-09-01 14:01:03 | NO alpha | 2009-09-01 14:01:17 | page_1 | alpha_2009-09-01 14:01:03 | NO alpha | 2009-09-01 14:01:49 | page_2 | alpha_2009-09-01 14:01:49 | YES alpha | 2009-09-01 14:01:55 | page_3 | alpha_2009-09-01 14:01:49 | NO bravo | 2009-09-01 14:01:07 | page_4 | bravo_2009-09-01 14:01:07 | YES bravo | 2009-09-02 15:03:07 | page_4 | bravo_2009-09-02 15:03:07 | YES charlie | 2009-09-01 14:01:03 | page_1 | charlie_2009-09-01 14:01:03 | YES charlie | 2009-09-01 14:01:08 | page_2 | charlie_2009-09-01 14:01:03 | NO charlie | 2009-09-01 14:01:17 | page_1 | charlie_2009-09-01 14:01:03 | NO charlie | 2009-09-01 14:01:49 | page_2 | charlie_2009-09-01 14:01:49 | YES charlie | 2009-09-01 14:01:55 | page_3 | charlie_2009-09-01 14:01:49 | NO
Is that not the sessionization that Aster Data is talking about? The queries above show that it is possible in a perfectly normal looking SQL extension, to provide the ability to support an arbitrary length clickstream without the use of MapReduce.
I will be the first to admit that there are situations that I still need to think about and handle, and circular references are a small problem. I am trying to figure out how to implement forward references in an efficient way that does not involve multiple passes and the code will soon look like the code that resolves formulae in a spreadsheet.
But, I much prefer this solution to the one provided by Aster Data and SQL/MR. At least it looks like SQL … (Yes, I don’t like the fact that I’m using GROUP BY to determine the PARTITION and there is no AGGREGATE function. That is something that must be addressed. This is a demonstration of a concept only).
MapReduce is not the answer because solutions are being created by database vendors in entirely non-standard and incompatible ways that will only reduce portability and increase vendor lock-in.
Code written for SQL/MR will only with Aster Data. Code written for use with Greenplum will only work with Greenplum. The same can be said of each and every vendor who is providing a MapReduce “Wart-On-The-Side” (WOTS) implementation.
Note: This is the first time I have seen the term WOTS or “Wart-On-The-Side” being used. So, feel free to use the term but remember where it came from 🙂
The solution is to extend the SQL language and provide a mechanism for it to describe the kinds of problems that we need to solve and leave the implementation of the language constructs to each vendor. Customers of databases would be well advised to reject these “vendor specific” WOTS implementations and instead demand a standards compliant extension that will ensure portability. I appreciate David DeWitt’s point of view that “The database community recognizes that database systems are too “hard” to use and is working to solve this problem”. I’m sure they will solve the problem; MapReduce is one such solution by the database community.
I believe that customers will be better served by a standards driven solution and not a “database community” driven solution.
 Currently I am hoping I can get away by saying that in the event of a circular reference, COLUMNVAL will return NULL. For some reason, I think that solution won’t pass muster.