Is this the end of NoSQL?

If it is, you read it here first!

I posted this article on my other (work related) blog.

I think the future for NoSQL isn’t as bright as a lot of pundits would have you believe. Yes, Yes, I know that MongoDB got a $1.2 billion valuation. Some other things to keep in mind.
  1. In the heyday of OODBMS, XML DB, and OLAP/MDX, there was similar hype about those technologies.
  2. Today, more and more NoSQL vendors are trying to build “SQL’isms” into their products. I often hear of people who want a product that has the scalability of NoSQL with transactions and a standard query language. Yes, we have that; it is called a horizontally scalable RDBMS!

Technologies come and technologies go but the underlying trends are worth understanding.

And the trends don’t favor NoSQL.

Why MongoDB and NoSQL make me want to scream

Recently I saw an article on the MongoHQ blog where they described a “slow query” and how to improve the performance.

The problem is this. I have a document with the following fields:

  • ID
  • submitdate
  • status
  • content

And the status can be something like ‘published’, ‘rejected’, ‘in progress’, ‘draft’ etc.,

I want to find all articles with some set of statuses and sorted by submit date.

Apparently the MongoDB solution to this problem (according to their own blog) is to:

  1. create a new field called ‘unpublished_submit_date’
  2. set that field to a ‘null’ value if the document is of an uninteresting status (i.e. published)
  3. set that field to the submitdate if it is an interesting status (i.e not published)
  4. then query on the single column unpublished_submit_date

Really? Really? You’ve got to be kidding me.

For more on this interesting exchange, a response from a MongoDB fanboy, and a follow-up, read my work blog at

http://parelastic.com/blog/more-subject-improving-performance-removing-query-logic

The things people have to do to use NoSQL, boggles the mind!

Database scalability myth (again)

A common myth that has been perpetrated is that relational database do not scale beyond two or three nodes. That, and the CAP Theorem are considered to be the reason why relational databases are unscalable and why NoSQL is the only feasible solution!

I ran into a very thought provoking article that makes just this case yesterday. You can read that entire post here. In this post, the author Srinath Perera provides an interesting template for choosing the data store for an application. In it, he makes the case that relational databases do not scale beyond 2 or 5 nodes. He writes,

The low scalability class roughly denotes the limits of RDBMS where they can be scaled by adding few replicas. However, data synchronization is expensive and usually RDBMSs do not scale for more than 2-5 nodes. The “Scalable” class roughly denotes data sharded (partitioned) across many nodes, and high scalability means ultra scalable systems like Google.

In 2002, when I started at Netezza, the first system I worked on (affectionately called Monolith) had almost 100 nodes. The first production class “Mercury” system had 108 nodes (112 nodes, 4 spares). By 2006, the systems had over 650 nodes and more recently much larger systems have been put into production. Yet, people still believe that relational databases don’t scale beyond two or three nodes!

Systems like ParElastic (Elastic Transparent Sharding) can certainly scale to much more than two or three nodes, and I’ve run prototype systems with upto 100 nodes on Amazon EC2!

Srinath’s post does contain an interesting perspective on unstructured and semi-structured data though, one that I think most will generally agree with.

Look Ma! NoSQL!

More musings on NoSQL and a blog I read “NoSQL: If Only It Was That Easy”

There has definitely been more chatter about NoSQL in the Boston area lately. I hear there is a group forming around NoSQL ( I will post more details when I get them ). There were some NoSQL folks at the recent Cloud Camp which I was not able to attend (damn!).

My views on NoSQL are unchanged from an earlier post on the subject. I think there are some genuine issues about database scaling that are being addressed through a variety of avenues (packages, tools, …). But, in the end, the reason that SQL has survived for so long is because it is a descriptive language that is reasonably portable. That is also the reason why, in the data warehousing space, you have each vendor going off and doing some non-SQL extension in a totally non-portable way. And they are all going to, IMHO, have to reconcile their differences before the features get wide mainstream adoption.

This morning I read a well researched blog post by BJ Clark by way of Hacker News. (If you don’t use HN, you should definitely give it a try).

I strongly recommend that if you are interested in NoSQL, you read the conclusion section carefully. I have annotated the conclusion section below.

“NoSQL is a great tool, but it’s certainly not going to be your competitive edge, it’s not going to make your app hot, and most of all, your users won’t give a shit about any of this.

What am I going to build my next app on? Probably Postgres.

Will I use NoSQL? Maybe. [I would not, but that may just be my bias]

I might keep everything in flat files. [Yikes! If I had to do this, I’d consider something like MySQL CSV first]


If I need reporting, I won’t be using any NoSQL.

If I need ACIDity, I won’t use NoSQL.

If I need transactions, I’ll use Postgres.

…”

NoSQL is a great stepping stone, what comes next will be really exciting technology. If what we need is a database that scales, let’s go make ourselves a database that scales. Base it on MySQL, PostgreSQL, … but please make it SQL based. Extend SQL if you have to. I really do like to be able to coexist with the rich ecosystem of visualization tools, reporting tools, dashboards, … you get the picture.


Not so fast, maybe relational databases aren’t dead!

Maybe the obituary announcing the demise of the relational database was premature!

Much has been written recently about the demise (or in some cases, the impending demise) of the relational database. “Relational databases are dead” writes Savio Rodrigues on July 2nd, I guess I missed the announcement and the funeral in the flood of emails and twitters about another high profile demise.

Some days ago, Michael Stonebraker authored an article with the title, “The End of a DBMS Era (Might be Upon Us)”. In September 2007 he made a similar argument in this article, and also in this 2005 paper with Uğur Çetintemel.

What Michael says here is absolutely true. And, in reality, Savio’s article just has a catchy title (and it worked). The body of the article makes a valid argument that there are some situations where the current “one size fits all” relational database offering that was born in the OLTP days may not be adequate for all data management problems.

So, let’s be perfectly clear about this; the issue isn’t that relational databases are dead. It is that a variety of use use cases are pushing the current relational database offerings to their limits.

I must emphasize that I consider relational databases (RDBMS’s) to be those systems that use a relational model (a definition consistent with http://en.wikipedia.org/wiki/Relational_database). As a result, columnar (or vertical) representations, row (or horizontal) representations, systems with hardware acceleration (FPGA’s, …) are all relational databases. There is arguably some confusion in terminology in the rest of this post, especially where I quote others who tend to use the term “Relational Database” more narrowly, so as to create a perception of differentiation between their product (columnar, analytic, …) and the conventional row oriented database which they refer to as an RDBMS.

Tony Bain begins his three part series about the problem with relational databases with an introduction where he says

“The specialist solutions have be slowly cropping up over the last 5 years and now today it wouldn’t be that unusual for an organization to choose a specialist data analytics database platform (such as those offered from Netezza, Greenplum, Vertica, Aster Data or Kickfire) over a generic database platform offered by IBM, Microsoft, Oracle or Sun for housing data for high end analytics.”

While I have some issues with his characterization of “specialist analytic database platforms” as something other than a Relational Database, I assume that he is using the term RDBMS to refer to the commonly available (general purpose) databases that are most often seen in OLTP environments.

I believe that whether you refer to a column oriented architecture (with or without compression), an architecture that uses hardware acceleration (Kickfire, Netezza, …) or a materialized view, you are attempting to address the same underlying issue; I/O is costly and performance is significantly improved when you reduce the I/O cost. Columnar representations significantly reduce I/O cost by not performing DMA on unnecessary columns of data. FPGA’s in Netezza serve a similar purpose; (among other things) they perform projections thereby reducing the amount of data that is DMA’ed. A materialized view with only the required columns (narrow table, thin table) serves the same purpose. In a similar manner (but for different reasons), indexes improve performance by quickly identifying the tuples that need to be DMA’ed.

Notice that all of these solutions fundamentally address one aspect of the problem; how to reduce the cost of I/O. The challenges that are facing databases these days are somewhat different. In addition to huge amounts of data that are being amassed (The Richard Winter article on the subject) there is a much broader variety of things that are being demanded of the repository of that information. For example, there is the “Search” model that has been discussed in a variety of contexts (web, peptide/nucleotide), the stream processing and data warehousing cases that have also received a fair amount of discussion.

Unlike the problem of I/O cost, many of these problems reflect issues with the fundamental structure and semantics of the SQL language. Some of these issues can be addressed with language extensions, User Defined Functions, MapReduce extensions and the like. But none of these address the underlying issue that the language and semantics were defined for a class of problems that we today come to classify as the “OLTP use case”.

Relational databases are not dead; on the contrary with the huge amounts of information that are being handled, they are more alive than ever before. The SQL language is not dead but it is in need of some improvements. That’s not something new; we’ve seen those in ’92, ’99, … But, more importantly the reason why the Relational Database and SQL have survived this long is because it is widely used and portable. By being an extensible and descriptive language, it has managed to adapt to many of the new requirements that were placed on it.

And if the current problems are significant, two more problems are just around the problem and waiting to rear their ugly heads. The first is the widespread adoption of the virtualization and the abstraction of computing resources. In addition to making it much hardware to adopt solutions with custom hardware (that cannot be virtualized), it introduces a level of unpredictability in I/O bandwidth, latency and performance. Right along with this, users are going to want the database to live on the cloud. With that will come all the requirements of scalability, ease of use and deployment that one associates with a cloud based offering (not just the deployment model). The second is the fact that users will expect one “solution” to meet a wide variety of demands including the current OLTP and reporting through the real time alerting that today’s “Google/Facebook/Twitter Generation” has come to demand (look-ma-no-silos).

These problems are going to drive a round of innovation, and the NoSQL trend is a good and healthy trend. In the same description of all the NoSQL and analytics alternatives, one should also mention the various vendors who are working on CEP solutions. As a result of all of these efforts, Relational Databases as we know them today (general purpose OLTP optimized, small data volume systems) will evolve into systems capable of managing huge volumes of data in a distributed/cloud/virtualized environment and capable of meeting a broad variety of consumer demands.

The current architectures that we know of (shared disk, shared nothing, shared memory) will need to be reconsidered in a virtualized environment. The architectures of our current databases will also need some changes to address the wide variety of consumer demands. Current optimization techniques will need to be adapted and the underlying data representations will have to change. But, in the end, I believe that the thing that will decide the success or failure of a technology in this area is the extent of compatibility and integration with the existing SQL language. If the system has a whole new set of semantics and is fundamentally incompatible with SQL I believe that adoption will slow. A system that extends SQL and meets these new requirements will do much better.

Relational Databases aren’t dead; the model of “one-size-fits-all” is certainly on shaky ground! There is a convergence between the virtualization/cloud paradigms, the cost and convenience advantages of managing large infrastructures in that model and the business need for large databases.

Fasten your seat-belts because the ride will be rough. But, it is a great time to be in the big-data-management field!