Sunset in Rye, New Hampshire

I got two interesting pictures of the sunset in Rye, NH today. The two pictures were each made by tiling 5 images using autostitch.

The two pictures are five minutes apart and the colors changed quite interestingly in that 5 minute interval.

And about five minutes later

I have reduced the size on these images to 70% so that they show up ok on the blog page. Larger images are on flickr (use the left panel).

Look Ma! NoSQL!

More musings on NoSQL and a blog I read “NoSQL: If Only It Was That Easy”

There has definitely been more chatter about NoSQL in the Boston area lately. I hear there is a group forming around NoSQL ( I will post more details when I get them ). There were some NoSQL folks at the recent Cloud Camp which I was not able to attend (damn!).

My views on NoSQL are unchanged from an earlier post on the subject. I think there are some genuine issues about database scaling that are being addressed through a variety of avenues (packages, tools, …). But, in the end, the reason that SQL has survived for so long is because it is a descriptive language that is reasonably portable. That is also the reason why, in the data warehousing space, you have each vendor going off and doing some non-SQL extension in a totally non-portable way. And they are all going to, IMHO, have to reconcile their differences before the features get wide mainstream adoption.

This morning I read a well researched blog post by BJ Clark by way of Hacker News. (If you don’t use HN, you should definitely give it a try).

I strongly recommend that if you are interested in NoSQL, you read the conclusion section carefully. I have annotated the conclusion section below.

“NoSQL is a great tool, but it’s certainly not going to be your competitive edge, it’s not going to make your app hot, and most of all, your users won’t give a shit about any of this.

What am I going to build my next app on? Probably Postgres.

Will I use NoSQL? Maybe. [I would not, but that may just be my bias]

I might keep everything in flat files. [Yikes! If I had to do this, I’d consider something like MySQL CSV first]


If I need reporting, I won’t be using any NoSQL.

If I need ACIDity, I won’t use NoSQL.

If I need transactions, I’ll use Postgres.

…”

NoSQL is a great stepping stone, what comes next will be really exciting technology. If what we need is a database that scales, let’s go make ourselves a database that scales. Base it on MySQL, PostgreSQL, … but please make it SQL based. Extend SQL if you have to. I really do like to be able to coexist with the rich ecosystem of visualization tools, reporting tools, dashboards, … you get the picture.


On over-engineered solutions

The human desire to over-engineer solutions

On a recent trip, I had way too much time on my hands and was ambling around the usual overpriced shops in the airport terminal. There I saw a 40th anniversary “special edition” of the Space Pen. For $799.99 (plus taxes) you could get one of these pens, and some other memorabilia to commemorate the 40th anniversary of the first lunar landing. If you aren’t familiar with the Space Pen, you can look at learn more at the web site of the Fisher Space Pen Co.

Fisher Space Pen
Fisher Space Pen

For $800 you could get AG7-40LE – Astronaut Space Pen 40th Year Moon Landing Celebration Commemorative Pen & Box.

Part of this pen actually circled the moon!

Salient features of this revolutionary device:

  • it writes upside down
  • it writes in any language (so the Russians bought some of these :))
  • it draws pictures
  • it writes in zero gravity
  • it writes under water
  • it writes on greasy surfaces
  • it fixes broken switches on lunar modules

If you want to know about the history of this device, you can also look at NASA’s web page here. On the web page of the Fisher Space Pen Co, you can also see the promotion of the  AG7-40 – Astronaut Space Pen 40th Year Moon Landing Celebration Engraving.

While in the airport, I saw a couple of security officers rolling around on the Segway Personal Transporter. Did you know that for approximately $10,000 you could get yourself a Segway PT i2 Ferrari Limited Edition? I have no idea how much the airport paid for them but the Segway i2 is available on Amazon for about $6,000. It did strike me as silly, till I noticed three officers (a lot more athletic) riding through the terminal on Trek bicycles. That seemed a lot more reasonable. I have a bicycle like that, and it costs maybe 10% of a Segway.

I got thinking about the rampant over-engineering that was all around me, and happened upon this web page, when I did a search for Segway!

How to render the Segway Human Transporter Obsolete by Maddox.

Who would have thought of this, just add a third wheel and you could have a vehicle just as revolutionary? I thought about it some more and figured that the third wheel in Maddox’s picture is probably not the best choice; maybe it should be a wheel more like the track-ball of a mouse. That would have no resistance to turning and the contraption could do an effortless zero radius turn if required. The ball could be spring loaded and the whole thing could be on an arm that had some form of shock absorbing mechanism.

And, if we were to have done that, we would not have seen this. Bush Fails Segway Test! As an interesting aside, did you know that the high tech heart stent used for people who have bypass surgery was also invented by Dean Kamen?

We have a million ways to solve most problems. Why then do we over-engineer solutions to the point where the solution introduces more problems?

Keep it simple! There are less things that can break later.

Oh, and about that broken switch in the lunar module. Buzz Aldrin has killed that wonderful urban legend by saying he used a felt tip pen.

Buzz Aldrin still has the felt-tipped pen he used as a makeshift switch needed to fire up the engines that lifted him and fellow Apollo 11 astronaut Neil Armstrong off the moon and started their safe return to Earth nearly 40 years ago.

“The pen and the circuit breaker (switch) are in my possession because we do get a few memorabilia to kind of symbolize things that happened,” Aldrin told reporters Friday.

If Buzz Aldrin used a felt tipped pen, why do we need a Space Pen? And what exactly are we celebrating by buying a $800 pen that can’t even fix a broken switch. A pencil can write upside down (and also write in any language :)). Why do we need Segway Human Transporters in an airport when most security officers should be able to walk or ride a bicycle. Why do we build complex software products and make them bloated, unusable, incomprehensible and expensive?

That’s simple, we’re paying a homage to our overpowering desire to over-engineer solutions.

How to render the Segway Human Transporter obsolet

Xtremedata’s latest announcement

A short summary of the dbX announcement by xtremedata and ISA’s in data warehouse appliances

Xtremedata has announced their entry into the data warehouse appliance space. Called the dbX, this product is an interesting twist on the use of FPGA’s (Field Programmable Gate Arrays). Unlike the other data warehouse appliance companies, Xtremedata is not a data warehouse appliance startup. They are a six year old company with an established product line and a patent for the In Socket Accelerator that is part of dbX.

What is ISA?

The core technology that Xtremedata makes is a device called an In Socket Accelerator. Occupying a CPU or co-processor slot on a motherboard, this plug compatible device emulates a CPU using an FPGA. But, the ISA goes beyond the functionality provided by the CPU that it replaces.

Xtremedata’s patent (US Patent 7454550 – Systems and methods for providing co-processors to computing systems. issued in 2008) includes the following description

It is sometimes desirable to provide additional functionality/capability or performance to a computer system and thus, motherboards are provided with means for receiving additional devices, typically by way of “expansion slots.” Devices added tothe motherboard by these expansion slots communicate via a standard bus. The expansion slots and bus are designed to receive and provide data transport of a wide array of devices, but have well-known design limitations.

One type of enhancement of a computer system involves the addition of a co-processor. The challenges of using a co-processor with an existing computer system include the provision of physical space to add the co-processor, providing power to the co-processor, providing memory for the co-processor, dissipating the additional heat generated by the co-processor and providing a high-speed pipe for information to and from the co-processor.

Without replacing the socket, which would require replacing the motherboard, the CPU cannot be presently changed to one for which the socket was not designed, which might be desirable in providing features, functionality, performance and capabilities for which the system was not originally designed.

and

FPGA accelerators and the like, including counterparts therefore, e.g., application-specific integrated circuits (ASIC), are well known in the high-performance computing field. Nallatech, (see http://www.nallatech.com) is an example of several vendors that offer FPGA accelerator boards that can be plugged into standard computer systems. These boards are built to conform to industry standard I/O (Input/Output) interfaces for plug-in boards. Examples of such industry standards include: PCI (PeripheralComponent Interconnect) and its derivatives such as PCI-X and PCI-Express. Some computer system vendors, for example, Cray, Inc., (see http://www.cray.com) offer built-in FPGA co-processors interfaced via proprietary interfaces to the CPU.

What else does Xtremedata do?

Their core product is the In Socket Accelerator with applications in Financial Services, Medical, Military, Broadcast, High Performance Computing and Wireless (source: company web page). The dbX is their latest venture integrating their ISA in a standard HP server and providing “SQL on a chip”.

How does this compare with Netezza?

The currently shipping systems from Netezza also include an FPGA but, as described in their patent (US patent 709231000, Field oriented pipeline architecture for a programmable data streaming processor) places an FPGA between a disk drive and a processor (as a programmable data streaming processor). In the interest of full disclosure, I must also point out that I worked at Netezza and on the product described here. All descriptions in this blog are strictly based on publicly available information and references are provided.

Xtremedata, on the other hand, uses the FPGA in an ISA and emulates the processor.

So, the two are very different architectures, and accomplish very different things.

Other fun things that can be done with this technology 🙂

Similar technology was used to make a PDP-10 processor emulator. You can read more about that project here. Using similar technology and a Xilinx FPGA, the folks described in this project were able to make a completely functional PDP-10 processor.

Or, if you want a new Commodore 64 processor, you can read more about that project here.

Why are these links relevant? ISA’s and processor emulators seem like black magic. After all, Intel and AMD spend tons of money and their engineers spend a huge amount of time and resource to design and build CPU’s. It may seem outlandish to claim that one can reimplement a CPU using some device that one analyst called “Field Programmable Gatorade”. These two projects give you an idea of what people do with this Gatorade thing to make a processor.

And if you think I’m making up the part about Gatorade, look here and here (search for Gatorade, it is a long article).

About Xtremedata

XtremeData, Inc. creates hardware accelerated Database Analytics Appliances and is the inventor and leader in FPGA-based In-Socket Accelerators(ISA). The company offers many different appliances and FPGA-based ISA solutions for markets such as Decision Support Systems, Financial Analytics, Video Transcoding, Life Sciences, Military, and Wireless. Founded in 2003 XtremeData has established two centers: Headquarters are located in Schaumburg, IL (near Chicago, Illinois, USA) and a software development location in Bangalore, India.

XtremeData, Inc. is a privately held company.

source: company web page

My Opinion

This is a technology that I have been watching for some time now. It will be interesting to see how this technique compares (price, performance, completeness) with other vendors who are already in the field. By adopting industry standard hardware (HP) and being a certified vendor for HP (Qualified by HP’s accelerator program), and because this isn’t the only thing they do with these accelerators, it promises to be an interesting development in the market.

Wondering about “Shared Nothing”

Is “Shared Nothing” the best architecture for a database? Is Stonebraker’s assertion of 1986 still valid? Were Norman et al., correct in 1996 when they made a case for a hybrid system? Have recent developments changed the dynamics?

Over the past several months, I have been wondering about the Shared Nothing architecture that seems to be very common in parallel and massively parallel systems (specifically for databases) these days. With the advent of technologies like cheap servers and fast interconnects, there has been a considerable amount of literature that point to an an apparent “consensus” that Shared Nothing is the preferred architecture for parallel databases. One such example is Stonebraker’s 1986 paper (Michael Stonebraker. The case for shared nothing. Database Engineering, 9:4–9, 1986). A reference to this apparent “consensus” can be found in a paper by Dewitt and Gray (Dewitt, Gray Parallel database systems: the future of high performance database systems 1992).

A consensus on parallel and distributed database system architecture has emerged. This architecture is based on a shared-nothing hardware design [STON86] in which processors communicate with one another only by sending messages via an interconnection network.

But two decades or so later, is this still the case, is “Shared Nothing” really the way to go? Are there other, better options that one should consider?

As long ago as 1996, Norman, Zurek and Thanisch (Norman, Zurek and Thanisch. Much ado about shared nothing 1996) made a compelling argument for hybrid systems. But even that was over a decade ago. And the last decade has seen some rather interesting changes. Is the argument proposed in 1996 by Norman et al., still valid? (Updated 2009-08-02: A related article in Computergram in 1995 can be read at http://www.cbronline.com/news/shared_nothing_parallel_database_architecture)

With the advent of clouds and virtualization, doesn’t one need to seriously consider the shared nothing architecture again? Is a shared disk architecture in a cloud even a worthwhile conversation to have? I was reading the other day about shared nothing cluster storage. What? That seems like an contradiction, doesn’t it?

Some interesting questions to ponder:

  1. In the current state of technology, are the characterizations “Shared Everything”, “Shared Memory”, “Shared Disk” and “Shared Nothing” still sufficient? Do we need additional categories?
  2. Is Shared Nothing the best way to go (As advocated by Stonebraker in 1986) or is a hybrid system the best way to go (as advocated by Norman et al., in 1996) for a high performance database?
  3. What one or two significant technology changes could cause a significant impact to the proposed solution?

I’ve been pondering this question for some time now and can’t quite seem to decide which way to lean. But, I’m convinced that the answer to #1 is that we need additional categories based on advances in virtualization. But, I am uncertain about the choice between a Hybrid System and a Shared Nothing system. The inevitable result of advancements in virtualization and clouds and such technologies seem to indicate that Shared Nothing is the way to go. But, Norman and others make a compelling case.

What do you think?

Michael Stonebraker. The case for shared nothing. Database Engineering, 9:4–9, 1986

Wow, the TPC-H speculation continues!

The real technology behind the ParAccel TPC-H results are revealed here!

It is definitely interesting to see that the ParAccel TPC-H result saga is not yet done. Posts by Daniel Abadi (interesting analysis but it seems simplistic at first blush) and reflections on Curt Monash’s blog are proving to be amusing, to say the least.

At the end of the day, whether ParAccel is a “true column store” (or a “vertically partitioned row store”), and whether it has too much disk capacity and disk bandwidth strike me as somewhat academic arguments that don’t recognize some basic facts.

  • ParAccel’s system (bloated, oversized, undersized, truly columnar, OR NOT) is only the second system to post 30TB results.
  • That system costs less than half of the price of the other set of published results. Since I said “basic facts”, I will point out that the other entry may provide higher resiliency and that may be a part of the higher price. I have not researched whether the additional price is justified, related to the higher resiliency, etc., I want to make the point that there is a difference in the stated resiliency of the two solutions that is not apparent from the performance claims.

It might just be my sheltered upbringing but:

  • I know of few customers who purchase cars solely for the manufacturers advertised MPG and similarly, I know of no one who purchases a data-warehouse from a specific vendor because of the published TPC-H results.
  • I know of few customers who will choose to make a purchase decision based on whether a car has a inline engine, a V-engine or a rotary engine and similarly, I know of no one who makes a buying decision on a data warehouse based on whether the technology is “truly columnar”, “vertically partitioned row store”, “row store with all indexes” or some other esoteric collection of interesting sounding words?

So, I ask you, why are so many people so stewed about ParAccel’s TPC-H numbers? In a few more days, will we have more posts about ParAccel’s TPC-H numbers than we will about whether Michael Jackson really died or whether this is all part of a “comeback tour”?

I can’t answer either of the questions above for sure but what I  do know is this, ParAccel is getting some great publicity out of this whole thing.

And, I have it on good authority (it came to me in a dream last night) that ParAccel’s solution is based on high-performance trilithium crystals. (Note: I don’t know why this wasn’t disclosed in the full disclosure report). I hear that they chose 43 nodes because someone misremembered the “universal answer” from The Hitchhikers Guide to the Galaxy. By the time someone realized this, it was too late because the data load had begun. Remember you read it here first 🙂

Give it a rest folks!

P.S. Within minutes of posting this a well known heckler sent me email with the following explanation that confirms my hypothesis.

When a beam of matter and antimatter collide in dilithium we get a plasma field that powers warp drives within the “Sun” workstations. The warp drives that ParAccel uses are the Q-Warp variant which allow queries to run faster than the speed of light. A patent has been filed for this technique, don’t mention it in your blog please.

Not so fast, maybe relational databases aren’t dead!

Maybe the obituary announcing the demise of the relational database was premature!

Much has been written recently about the demise (or in some cases, the impending demise) of the relational database. “Relational databases are dead” writes Savio Rodrigues on July 2nd, I guess I missed the announcement and the funeral in the flood of emails and twitters about another high profile demise.

Some days ago, Michael Stonebraker authored an article with the title, “The End of a DBMS Era (Might be Upon Us)”. In September 2007 he made a similar argument in this article, and also in this 2005 paper with Uğur Çetintemel.

What Michael says here is absolutely true. And, in reality, Savio’s article just has a catchy title (and it worked). The body of the article makes a valid argument that there are some situations where the current “one size fits all” relational database offering that was born in the OLTP days may not be adequate for all data management problems.

So, let’s be perfectly clear about this; the issue isn’t that relational databases are dead. It is that a variety of use use cases are pushing the current relational database offerings to their limits.

I must emphasize that I consider relational databases (RDBMS’s) to be those systems that use a relational model (a definition consistent with http://en.wikipedia.org/wiki/Relational_database). As a result, columnar (or vertical) representations, row (or horizontal) representations, systems with hardware acceleration (FPGA’s, …) are all relational databases. There is arguably some confusion in terminology in the rest of this post, especially where I quote others who tend to use the term “Relational Database” more narrowly, so as to create a perception of differentiation between their product (columnar, analytic, …) and the conventional row oriented database which they refer to as an RDBMS.

Tony Bain begins his three part series about the problem with relational databases with an introduction where he says

“The specialist solutions have be slowly cropping up over the last 5 years and now today it wouldn’t be that unusual for an organization to choose a specialist data analytics database platform (such as those offered from Netezza, Greenplum, Vertica, Aster Data or Kickfire) over a generic database platform offered by IBM, Microsoft, Oracle or Sun for housing data for high end analytics.”

While I have some issues with his characterization of “specialist analytic database platforms” as something other than a Relational Database, I assume that he is using the term RDBMS to refer to the commonly available (general purpose) databases that are most often seen in OLTP environments.

I believe that whether you refer to a column oriented architecture (with or without compression), an architecture that uses hardware acceleration (Kickfire, Netezza, …) or a materialized view, you are attempting to address the same underlying issue; I/O is costly and performance is significantly improved when you reduce the I/O cost. Columnar representations significantly reduce I/O cost by not performing DMA on unnecessary columns of data. FPGA’s in Netezza serve a similar purpose; (among other things) they perform projections thereby reducing the amount of data that is DMA’ed. A materialized view with only the required columns (narrow table, thin table) serves the same purpose. In a similar manner (but for different reasons), indexes improve performance by quickly identifying the tuples that need to be DMA’ed.

Notice that all of these solutions fundamentally address one aspect of the problem; how to reduce the cost of I/O. The challenges that are facing databases these days are somewhat different. In addition to huge amounts of data that are being amassed (The Richard Winter article on the subject) there is a much broader variety of things that are being demanded of the repository of that information. For example, there is the “Search” model that has been discussed in a variety of contexts (web, peptide/nucleotide), the stream processing and data warehousing cases that have also received a fair amount of discussion.

Unlike the problem of I/O cost, many of these problems reflect issues with the fundamental structure and semantics of the SQL language. Some of these issues can be addressed with language extensions, User Defined Functions, MapReduce extensions and the like. But none of these address the underlying issue that the language and semantics were defined for a class of problems that we today come to classify as the “OLTP use case”.

Relational databases are not dead; on the contrary with the huge amounts of information that are being handled, they are more alive than ever before. The SQL language is not dead but it is in need of some improvements. That’s not something new; we’ve seen those in ’92, ’99, … But, more importantly the reason why the Relational Database and SQL have survived this long is because it is widely used and portable. By being an extensible and descriptive language, it has managed to adapt to many of the new requirements that were placed on it.

And if the current problems are significant, two more problems are just around the problem and waiting to rear their ugly heads. The first is the widespread adoption of the virtualization and the abstraction of computing resources. In addition to making it much hardware to adopt solutions with custom hardware (that cannot be virtualized), it introduces a level of unpredictability in I/O bandwidth, latency and performance. Right along with this, users are going to want the database to live on the cloud. With that will come all the requirements of scalability, ease of use and deployment that one associates with a cloud based offering (not just the deployment model). The second is the fact that users will expect one “solution” to meet a wide variety of demands including the current OLTP and reporting through the real time alerting that today’s “Google/Facebook/Twitter Generation” has come to demand (look-ma-no-silos).

These problems are going to drive a round of innovation, and the NoSQL trend is a good and healthy trend. In the same description of all the NoSQL and analytics alternatives, one should also mention the various vendors who are working on CEP solutions. As a result of all of these efforts, Relational Databases as we know them today (general purpose OLTP optimized, small data volume systems) will evolve into systems capable of managing huge volumes of data in a distributed/cloud/virtualized environment and capable of meeting a broad variety of consumer demands.

The current architectures that we know of (shared disk, shared nothing, shared memory) will need to be reconsidered in a virtualized environment. The architectures of our current databases will also need some changes to address the wide variety of consumer demands. Current optimization techniques will need to be adapted and the underlying data representations will have to change. But, in the end, I believe that the thing that will decide the success or failure of a technology in this area is the extent of compatibility and integration with the existing SQL language. If the system has a whole new set of semantics and is fundamentally incompatible with SQL I believe that adoption will slow. A system that extends SQL and meets these new requirements will do much better.

Relational Databases aren’t dead; the model of “one-size-fits-all” is certainly on shaky ground! There is a convergence between the virtualization/cloud paradigms, the cost and convenience advantages of managing large infrastructures in that model and the business need for large databases.

Fasten your seat-belts because the ride will be rough. But, it is a great time to be in the big-data-management field!