Not so fast, maybe relational databases aren’t dead!

Maybe the obituary announcing the demise of the relational database was premature!

Much has been written recently about the demise (or in some cases, the impending demise) of the relational database. “Relational databases are dead” writes Savio Rodrigues on July 2nd, I guess I missed the announcement and the funeral in the flood of emails and twitters about another high profile demise.

Some days ago, Michael Stonebraker authored an article with the title, “The End of a DBMS Era (Might be Upon Us)”. In September 2007 he made a similar argument in this article, and also in this 2005 paper with Uğur Çetintemel.

What Michael says here is absolutely true. And, in reality, Savio’s article just has a catchy title (and it worked). The body of the article makes a valid argument that there are some situations where the current “one size fits all” relational database offering that was born in the OLTP days may not be adequate for all data management problems.

So, let’s be perfectly clear about this; the issue isn’t that relational databases are dead. It is that a variety of use use cases are pushing the current relational database offerings to their limits.

I must emphasize that I consider relational databases (RDBMS’s) to be those systems that use a relational model (a definition consistent with http://en.wikipedia.org/wiki/Relational_database). As a result, columnar (or vertical) representations, row (or horizontal) representations, systems with hardware acceleration (FPGA’s, …) are all relational databases. There is arguably some confusion in terminology in the rest of this post, especially where I quote others who tend to use the term “Relational Database” more narrowly, so as to create a perception of differentiation between their product (columnar, analytic, …) and the conventional row oriented database which they refer to as an RDBMS.

Tony Bain begins his three part series about the problem with relational databases with an introduction where he says

“The specialist solutions have be slowly cropping up over the last 5 years and now today it wouldn’t be that unusual for an organization to choose a specialist data analytics database platform (such as those offered from Netezza, Greenplum, Vertica, Aster Data or Kickfire) over a generic database platform offered by IBM, Microsoft, Oracle or Sun for housing data for high end analytics.”

While I have some issues with his characterization of “specialist analytic database platforms” as something other than a Relational Database, I assume that he is using the term RDBMS to refer to the commonly available (general purpose) databases that are most often seen in OLTP environments.

I believe that whether you refer to a column oriented architecture (with or without compression), an architecture that uses hardware acceleration (Kickfire, Netezza, …) or a materialized view, you are attempting to address the same underlying issue; I/O is costly and performance is significantly improved when you reduce the I/O cost. Columnar representations significantly reduce I/O cost by not performing DMA on unnecessary columns of data. FPGA’s in Netezza serve a similar purpose; (among other things) they perform projections thereby reducing the amount of data that is DMA’ed. A materialized view with only the required columns (narrow table, thin table) serves the same purpose. In a similar manner (but for different reasons), indexes improve performance by quickly identifying the tuples that need to be DMA’ed.

Notice that all of these solutions fundamentally address one aspect of the problem; how to reduce the cost of I/O. The challenges that are facing databases these days are somewhat different. In addition to huge amounts of data that are being amassed (The Richard Winter article on the subject) there is a much broader variety of things that are being demanded of the repository of that information. For example, there is the “Search” model that has been discussed in a variety of contexts (web, peptide/nucleotide), the stream processing and data warehousing cases that have also received a fair amount of discussion.

Unlike the problem of I/O cost, many of these problems reflect issues with the fundamental structure and semantics of the SQL language. Some of these issues can be addressed with language extensions, User Defined Functions, MapReduce extensions and the like. But none of these address the underlying issue that the language and semantics were defined for a class of problems that we today come to classify as the “OLTP use case”.

Relational databases are not dead; on the contrary with the huge amounts of information that are being handled, they are more alive than ever before. The SQL language is not dead but it is in need of some improvements. That’s not something new; we’ve seen those in ’92, ’99, … But, more importantly the reason why the Relational Database and SQL have survived this long is because it is widely used and portable. By being an extensible and descriptive language, it has managed to adapt to many of the new requirements that were placed on it.

And if the current problems are significant, two more problems are just around the problem and waiting to rear their ugly heads. The first is the widespread adoption of the virtualization and the abstraction of computing resources. In addition to making it much hardware to adopt solutions with custom hardware (that cannot be virtualized), it introduces a level of unpredictability in I/O bandwidth, latency and performance. Right along with this, users are going to want the database to live on the cloud. With that will come all the requirements of scalability, ease of use and deployment that one associates with a cloud based offering (not just the deployment model). The second is the fact that users will expect one “solution” to meet a wide variety of demands including the current OLTP and reporting through the real time alerting that today’s “Google/Facebook/Twitter Generation” has come to demand (look-ma-no-silos).

These problems are going to drive a round of innovation, and the NoSQL trend is a good and healthy trend. In the same description of all the NoSQL and analytics alternatives, one should also mention the various vendors who are working on CEP solutions. As a result of all of these efforts, Relational Databases as we know them today (general purpose OLTP optimized, small data volume systems) will evolve into systems capable of managing huge volumes of data in a distributed/cloud/virtualized environment and capable of meeting a broad variety of consumer demands.

The current architectures that we know of (shared disk, shared nothing, shared memory) will need to be reconsidered in a virtualized environment. The architectures of our current databases will also need some changes to address the wide variety of consumer demands. Current optimization techniques will need to be adapted and the underlying data representations will have to change. But, in the end, I believe that the thing that will decide the success or failure of a technology in this area is the extent of compatibility and integration with the existing SQL language. If the system has a whole new set of semantics and is fundamentally incompatible with SQL I believe that adoption will slow. A system that extends SQL and meets these new requirements will do much better.

Relational Databases aren’t dead; the model of “one-size-fits-all” is certainly on shaky ground! There is a convergence between the virtualization/cloud paradigms, the cost and convenience advantages of managing large infrastructures in that model and the business need for large databases.

Fasten your seat-belts because the ride will be rough. But, it is a great time to be in the big-data-management field!


6 thoughts on “Not so fast, maybe relational databases aren’t dead!”

  1. As with anything that gains momentum in IT – hyperbole is bound to exaggerate the whole “lets kill the RDBMS” thing.

    However, I think it points out a more important point, and one that should be heeded: RDMBS software is heavily overused and abused in software today – they are used as storage engines for purposes they are entirely inappropriate for (ever seen a “properties”-table?).

    RDBMS have their place – as stores for _relational_ data, but not as catch all storage engines for what the filesystems can do.

    Like

    1. Willie,

      Thanks for your comment. You make a point that I have heard often but I don’t quite understand.

      I am waiting to confirm a quote from Dr. Codd where he says, and I paraphrase, at the end of the day, all data is relational.

      I don’t see what it is that a file system can do that the relational database model cannot. Maybe the limitations are not the limitations of the relational database model but rather of the DDL/DML (the languages) and the operators and tools that the relational database implementations provide.

      What would be awesome would be a discussion based on an example of something that is outside the realm of the relational representation.

      I believe that what we are seeing is a situation where there are solutions that are “easier” in a less rigid “filesystem” model; one where you get to control everything and don’t have to stipulate operations in a descriptive language but rather in a lower-level programming language (curly-bracket-language).

      I welcome your thoughts and opinion,

      -amrith

      Like

      1. Here’s a solid example:
        Serving up files/images in a web server.
        Serving them from a file system at the front, rather than from a database at the back will always be faster and more scalable.

        File and network I/O on a Linux machine has a theoretical limit of ca 65K connections, which is a lot higher than the amount of queries for instance MySQL can handle (not even mentioning the overhead of hoping through extra network layers for db retrieval).

        If you are not querying, aggregating and writing relational data in a transactional way, a RDBMS is overkill and often counterproductive.

        Like

      2. Regarding Codd, I think what he actually showed was the ability of the relational model to *represent* any data structure. However, this is mostly interesting at the theoretical level. In practice you need a way to represent your data that gives good performance and an API that fits your application/domain reasonably well. If you look into the following thread you’ll find some comments related to this theme: http://www.metafilter.com/82478/Neo4j-traverses-depths-of-1000-levels-and-beyond-at-millisecond-speed

        In that thread the practical example is a four levels deep friend-of-friend traversal. To begin with it’s hard to solve this kind of problem using SQL because the language wasn’t created for this type of problems. And when you have solved that part, performance will still suck. Even an SQL dialect supporting recursive queries won’t help you here, as it just adds some syntactic sugar.

        I wrote up a small example of modeling a directed acyclic graph (DAG) here: http://wiki.neo4j.org/content/Roles
        DAGs are a terrible fit to the relational representation, but a good fit to a graph database.

        Like

  2. Willie,

    I understand your point. Thank you very much. I agree, this is a situation where you don’t need a relational database, or as you put it, one is counter-productive.

    Thank you,

    -amrith

    Like

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.