NoSQL – Hype Cycles

DynamoDB Condition Checks

DynamoDB Condition Checks are a very powerful (and somewhat) misunderstood capability that makes application development much easier.

Consider this most basic “financial” use-case. I have a table that contains customer account balances. Alice has $500 and Bob has $200. Bob wishes to give Alice $100. This is something really simple that screams “RDBMS” but is actually non-trivial with an RDBMS.

We all know that we need transactions because …

But consider this simple twist – How do you prevent Bob’s account balance from going negative?

There is (and I’d love to be proved wrong) no simple SQL solution to this in RDBMS without using one of (a) constraints, (b) triggers, or (c) stored procedures to implement this operation. The easiest is to stick a >= 0 constraint on the account balance. If Bob tries to give Alice more than $200, the update will fail with a constraint violation.

Why do you need one of these things? Consider this (flawed) implementation with an RDBMS.

--
-- WARNING: This implementation is flawed
--

BEGIN;

UPDATE BALANCES 
SET BALANCE = BALANCE + 300 
WHERE USER = "ALICE";

-- The update below may, or may not update Bob's balance!

UPDATE BALANCES 
SET BALANCE = BALANCE - 300 
WHERE USER = "BOB" AND BALANCE > 300;

COMMIT;

This will complete successfully with the first update giving Alice the money and the second succeeding without doing anything!

This is where the Condition Check in DynamoDB comes along. Consider this with DynamoDB

us-east-1> create table balances ( owner string ) primary key ( owner hash ) billing mode on demand;
CREATE
us-east-1> insert into balances ( owner, balance ) values ( "alice", 500 ), ( "bob", 200 );
INSERT
INSERT
us-east-1> select * from balances;
{balance: 200, owner: bob}
{balance: 500, owner: alice}

us-east-1> begin;
us-east-1> update balances set balance = balance + 300 where owner = "alice";
us-east-1> update balances set balance = balance - 300 where owner = "bob" and balance > 300;
us-east-1> commit;
Transaction failed. TransactionCanceledException. PLRD70PMCA382RM2S0T2OF4O7VVV4KQNSO5AEMVJF66Q9ASUAAJG. Transaction cancelled, please refer cancellation reasons for specific reasons [None, ConditionalCheckFailed]
ABORT
us-east-1>

We’re being told that the second write in the transaction had a ConditionCheckFailure (see the reasons List, first had None, second had CCF).

Now, we could redo the same thing with a valid amount, less than $200 like this.

us-east-1> select * from balances;
{balance: 200, owner: bob}
{balance: 500, owner: alice}
us-east-1> begin;
us-east-1> update balances set balance = balance + 100 where owner = "alice";
us-east-1> update balances set balance = balance - 100 where owner = "bob" and balance > 100;
us-east-1> commit;
COMMIT
us-east-1> select * from balances;
{balance: 100, owner: bob}
{balance: 600, owner: alice}
us-east-1>

Now that’s so much more civilized! The WHERE clause “balance > 100” turns into a condition check, and if it fails, so does the transaction.

In DynamoDB, that is all a single API call like so (illustrated using “explain”):

us-east-1> explain begin;
us-east-1> update balances set balance = balance + 100 where owner = "alice";
us-east-1> update balances set balance = balance - 100 where owner = "bob" and balance > 100;
us-east-1> commit;
TransactWriteItems({
   "TransactItems":   [{
         "Update":   {
            "Key":   {
               "owner":   {
                  "S":   "alice"
               }
            },
            "UpdateExpression":   "SET #akaa1 = #akaa1+:vkaa1",
            "TableName":   "balances",
            "ConditionExpression":   "attribute_exists(#akaa2)",
            "ExpressionAttributeNames":   {
               "#akaa1":   "balance",
               "#akaa2":   "owner"
            },
            "ExpressionAttributeValues":   {
               ":vkaa1":   {
                  "N":   "100"
               }
            }
         }
      }, {
         "Update":   {
            "Key":   {
               "owner":   {
                  "S":   "bob"
               }
            },
            "UpdateExpression":   "SET #alaa1 = #alaa1-:vlaa1",
            "TableName":   "balances",
            "ConditionExpression":   "attribute_exists(#alaa2) AND #alaa1 > :vlaa2",
            "ExpressionAttributeNames":   {
               "#alaa1":   "balance",
               "#alaa2":   "owner"
            },
            "ExpressionAttributeValues":   {
               ":vlaa1":   {
                  "N":   "100"
               },
               ":vlaa2":   {
                  "N":   "100"
               }
            }
         }
      }],
   "ClientRequestToken":   "1F087670-87E3-4973-8062-9A0A54E7F200"
})
us-east-1>

Notice that there is the update expression that updates the balance down by $100, but also a condition check that requires an item to exist that meets the requirement

owner = "bob" and balance > 100

Nifty!

In SQL, different databases give you constructs like this (from SQLServer)

UPDATE BALANCES
SET BALANCE = BALANCE - 300
WHERE OWNER = "Bob" AND BALANCE > 300;

IF @@ROWCOUNT = 0 ...

You’d then be able to decide whether or not you actually updated a row, and take corrective action.

DynamoDB UpdateItem() provides a condition check expression (even without a transaction).

us-east-1> select * from balances;
{balance: 100, owner: bob}
{balance: 600, owner: alice}
us-east-1> update balances set balance = balance - 200 where owner = "bob" and balance > 200;
UPDATE (0 read, 0 modified, 1 ccf)
us-east-1>

That Update gets transformed into this API call.

us-east-1> explain update balances set balance = balance - 200 where owner = "bob" and balance > 200;
UpdateItem({
   "TableName":   "balances",
   "Key":   {
      "owner":   {
         "S":   "bob"
      }
   },
   "UpdateExpression":   "SET #aoaa1 = #aoaa1-:voaa1",
   "ConditionExpression":   "attribute_exists(#aoaa2) AND #aoaa1 > :voaa2",
   "ExpressionAttributeNames":   {
      "#aoaa1":   "balance",
      "#aoaa2":   "owner"
   },
   "ExpressionAttributeValues":   {
      ":voaa1":   {
         "N":   "200"
      },
      ":voaa2":   {
         "N":   "200"
      }
   }
})
us-east-1>

Observe that in addition to specifying the Key (owner = bob), it provides a condition expression too!

NoSQL and “single-table” design pattern

The NoSQL “single-table” design pattern appears to be a polarizing topic with strong opinions for and against it. As best as I can tell, there’s no good reason for that!

I did a talk at AWS re:Invent last week along with Alex DeBrie. The talk was about deploying modern and efficient data models with Amazon DynamoDB. One small part of the talk was about the “single-table” design pattern. Over the next couple of days I have been flooded with questions about this pattern. I’m not really sure what all the hoopla is about this pattern, and why there is so much passion and almost religious fervor around this topic.

With RDBMS there are clearly defined benefits and drawbacks with normalization (and denormalization). Normalization and denormalization are an exercise in trading off between a well understood set of redundancies and anomalies, and runtime complexity and cost. When you normalize your data, the only mechanism to get it “back together” is using a “join”.

If you happen to use a database that doesn’t support joins, or if joins turn out to be expensive, you may prefer to accept the redundancies and anomalies that come with denormalization. This has been a long established pattern, for example in the analytics realm.

The “single-table” design pattern extends traditional RDBMS denormalization in three interesting ways. First, it quite often uses non-atomic datatypes that are not allowed in the normalized terminology of Codd, Date, and others. Second, makes use of the flexible schema support in NoSQL databases to commingle data from different entities in a single table. Finally, it uses data colocation guarantees in NoSQL databases to minimize the number of blocks read, and the number of API calls required in fetching related data.

Here’s what I think these options look like in practice.

First, this is a normalized schema with three tables. When you want to reconstruct the data, you join the tables. There are primary and foreign key constraints in place to ensure that data is consistent.

The next option is the fully denormalized structure where data from all tables is “pre-joined” into a single table.

The single-table schema is just slightly different. Data for all entities are commingled into a single table.

Application designers, and data modelers should look at their specific use-cases and determine whether or not they want to eliminate redundancies and inconsistencies and pay the cost of joins (or performing the join themselves in multiple steps if the database doesn’t support it), or denormalize and benefit from lower complexity and sometimes lower cost!

The other thing to keep in mind is that nothing in the single table design pattern requires you to bring all entities into a single table. A design where some entities are combined into a single table, coexists perfectly with others that are not.

What’s all the hoopla about?

Old write-up about CAP Theorem

In 2011, Ken Rugg and I were having a number of conversations around CAP Theorem and after much discussion, we came up with the following succinct inequality. It was able to help us much better speak to the issue of what constituted “availability”, “partition tolerance”, and “consistency”. It also confirmed our suspicions that availability and partition tolerance were not simple binary attributes; Yes or No, but rather that they had shades of gray.

So here’s the write-up we prepared at that time.

parelastic-brewers-conjecture

Unfortunately, the six part blog post that we wrote (on parelastic.com) never made it in the transition to the new owners.

Is this the end of NoSQL?

If it is, you read it here first!

I posted this article on my other (work related) blog.

The Future of Relational Databases (and NoSQL)

I think the future for NoSQL isn’t as bright as a lot of pundits would have you believe. Yes, Yes, I know that MongoDB got a $1.2 billion valuation. Some other things to keep in mind.

In the heyday of OODBMS, XML DB, and OLAP/MDX, there was similar hype about those technologies.
Today, more and more NoSQL vendors are trying to build “SQL’isms” into their products. I often hear of people who want a product that has the scalability of NoSQL with transactions and a standard query language. Yes, we have that; it is called a horizontally scalable RDBMS!

Technologies come and technologies go but the underlying trends are worth understanding.

And the trends don’t favor NoSQL.

Why MongoDB and NoSQL make me want to scream

Recently I saw an article on the MongoHQ blog where they described a “slow query” and how to improve the performance.

The problem is this. I have a document with the following fields:

ID
submitdate
status
content

And the status can be something like ‘published’, ‘rejected’, ‘in progress’, ‘draft’ etc.,

I want to find all articles with some set of statuses and sorted by submit date.

Apparently the MongoDB solution to this problem (according to their own blog) is to:

create a new field called ‘unpublished_submit_date’
set that field to a ‘null’ value if the document is of an uninteresting status (i.e. published)
set that field to the submitdate if it is an interesting status (i.e not published)
then query on the single column unpublished_submit_date

Really? Really? You’ve got to be kidding me.

For more on this interesting exchange, a response from a MongoDB fanboy, and a follow-up, read my work blog at

http://parelastic.com/blog/more-subject-improving-performance-removing-query-logic

The things people have to do to use NoSQL, boggles the mind!

Scaling MongoDB: A year with MongoDB (Engineering at KiiP)

Here is the synopsis:

A year with MongoDB in production
Nine months spent in moving 95% of the data off MongoDB and onto PostgreSQL

Over the past 6 months, we’ve “scaled” MongoDB by moving data off of it.

Read the complete article here: http://bit.ly/HIQ8ox

SQL, NoSQL, NewSQL and now and new term …

NonsenSQL!

Read all about it at C Mohan’s blog (cmohan.tumblr.com).

Mohan knows a thing or two about databases. As a matter of fact, keeping track of his database related achievements, publications and citations is in itself a big-data problem.

NonsenSQL, read all about it!

The MongoDB rant. Truth or hoax?

Two days ago, someone called ‘nomoremongo’ posted this on Y Combinator News.

Several people (me included) stumbled upon the article, read it, and took it at face value. It’s on the Internet, it’s got to be true, right?

No, seriously. I read it, and parts of it resonated with my understanding of how MongoDB works. I saw some of the “warnings” and they seemed real. I read this one (#7) and ironically, this was the one that convinced me that this was a true post.

**7. Things were shipped that should have never been shipped**

Things with known, embarrassing bugs that could cause data
problems were in "stable" releases--and often we weren't told
about these issues until after they bit us, and then only b/c
we had a super duper crazy platinum support contract with 10gen.

The response was to send up a hot patch and that they were
calling an RC internally, and then run that on our data.

Who but a naive engineer would feel this kind of self-righteous outrage 😉 I’ve shared this outrage at some time in my career, but then I also saw companies ship backup software (and have a party) when they knew that restore couldn’t possibly work (yes, a hot patch), software that could corrupt data in pretty main stream circumstances (yes, a hot patch before anyone installed stuff) etc.,

I spoke with a couple of people who know about MongoDB much better than I do and they all nodded about some of the things they read. The same article was also forwarded to me by someone who is clearly knowledgeable about MongoDB.

OK, truth has been established.

Then I saw this tweet.

Which was odd. Danny doesn’t usually swear (well, I’ve done things to him that have made him swear and a lot more but that was a long time ago). Right Danny?

Well, he had me at the “Start thinking for yourself”. But then he went off the meds, “MongoDB is the next MySQL”, really …

I think there’s a kernel of truth in the MongoDB rant. And it is certainly the case that a lot of startups are making dumb architectural decisions because someone told them that “MongoDB was web-scale”, or that “CAP Theorem told them that databases were dead”.

Was this a hoax? I don’t know. But it was certainly a reminder that all scams don’t originate in Nigeria, and don’t begin by telling me that I could make a couple of billion dollars if I just put up and couple of thousand.

On migrating from Microsoft SQL Server to MongoDB

Just reading this article http://www.wireclub.com/development/TqnkQwQ8CxUYTVT90/read describing one companies experiences migrating from SQL Server to MongoDB.

Having read the article, my only question to these folks is “why do it”?

Let’s begin by saying that we should discount all one time costs related to data migration. They are just that, one time migration costs. However monumental, if you believe that the final outcome is going to justify it, grin and bear the cost.

But, once you are in the (promised) MongoDB land, what then?

The things that this author believes that they will miss are:

maturity
tools
query expressiveness
transactions
joins
case insensitive indexes on text fields

Really, and you would still roll the dice in favor of a NoSQL science project. Well, then the benefits must be really really awesome! Let’s go take a look at what those are. Let’s take a look at what those are:

MongoDB is free
MongoDB is fast
Freedom from rigid schemas
ObjectID’s are expressive and handy
GridFS for distributed file storage
Developed in the open

OK, I’m scratching my head now. None of these really blows me away. Let’s look at these one at a time.

MongoDB is free

So is PostgreSQL and MySQL

MongoDB is fast

So are PostgreSQL and MySQL if you put them on the same SSD and multiple HDD’s like you claim you do with MongoDB

Freedom from rigid schemas

I’ll give you this one, relational databases are kind of “old school” in this department

ObjectID’s are expressive and handy

Elastic Transparent Sharding schemes like ParElastic overcome this with Elastic Sequences which give you the same benefits. A half-way decent developer could do this for you with a simple sharded architecture.

GridFS for distributed file storage

Replication anyone?

Developed in the open

Yes, MongoDB is free and developed in the open like a puppy is “free”. You just told us all the “costs” associated with this “free puppy”

So really, why do people use MongoDB? I know there are good circumstances where MongoDB will whip the pants off any relational database but I submit to you that those are the 1%.

To this day, I believe that the best description of MongoDB is this one:

http://www.xtranormal.com/watch/6995033/mongo-db-is-web-scale

Mongo DB is web scale
by: gar1t

http://www.xtranormal.com/xtraplayr/6995033/mongo-db-is-web-scale

Database scalability myth (again)

A common myth that has been perpetrated is that relational database do not scale beyond two or three nodes. That, and the CAP Theorem are considered to be the reason why relational databases are unscalable and why NoSQL is the only feasible solution!

I ran into a very thought provoking article that makes just this case yesterday. You can read that entire post here. In this post, the author Srinath Perera provides an interesting template for choosing the data store for an application. In it, he makes the case that relational databases do not scale beyond 2 or 5 nodes. He writes,

The low scalability class roughly denotes the limits of RDBMS where they can be scaled by adding few replicas. However, data synchronization is expensive and usually RDBMSs do not scale for more than 2-5 nodes. The “Scalable” class roughly denotes data sharded (partitioned) across many nodes, and high scalability means ultra scalable systems like Google.

In 2002, when I started at Netezza, the first system I worked on (affectionately called Monolith) had almost 100 nodes. The first production class “Mercury” system had 108 nodes (112 nodes, 4 spares). By 2006, the systems had over 650 nodes and more recently much larger systems have been put into production. Yet, people still believe that relational databases don’t scale beyond two or three nodes!

Systems like ParElastic (Elastic Transparent Sharding) can certainly scale to much more than two or three nodes, and I’ve run prototype systems with upto 100 nodes on Amazon EC2!

Srinath’s post does contain an interesting perspective on unstructured and semi-structured data though, one that I think most will generally agree with.

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: