A Report from Boston’s First “Big Data Summit”

A short write-up about last night’s Big Data Summit appeared on xconomy today.

My thanks to our sponsors, Foley Hoag LLP and the wonderful team at the Emerging Enterprise Center, Infobright, Expressor Software, and Kalido.

Boston Big Data Summit Kickoff, October 22nd 2009

BBD_logoSince the announcement of the Boston Big Data Summit on the 2nd of October, we have had a fantastic response. The event sold out two days ago. We figured that we could remove the tables from the room and accommodate more people. And, we sold out again. The response has been fantastic!

If you have registered but you are not going to be able to attend, please contact me and we will make sure that someone on the waiting list is confirmed.

There has been some question about what “Big Data” is. Curt Monash who will be delivering the keynote and moderating the discussion at the event next week writes:

… where “Big Data” evidently is to be construed as anything from a few terabytes on up.  (Things are smaller in the Northeast than in California …)

Little FishBig FishWhen you catch a fish (whether it is the little fish on the left or the bigger fish on the right), the steps to prepare it for the table are surprisingly similar. You may have more work to do with the big fish and you may use different tools to do it with; but the things are the same.

So, while size influences the situation, it isn’t only about the size!

In my opinion, whether data is “Big” or not is more of a threshold discussion. Data is “Big” if the tools and techniques being used to acquire, cleanse, pre-process, store, process and archive, are either unable to keep up, or are not cost effective.

Yes, everything is bigger in California, even the size of the mess they are in. Now, that is truly a “Big Problem”!

The 50,000 row spreadsheet, the half a terabyte of data in SQL Server, or the 1 trillion row table on a large ADBMS are all, in their own ways, “Big Data” problems.

The user with 50k rows in Excel may not want  ( or be able to afford ) a solution with a “real database”, and may resort to splitting the spreadsheet into two sheets. The user with half a terabyte of SQL Server or MySQL data may adopt some home-grown partitioning or sharding technique instead of upgrading to a bigger platform, and the user with a trillion CDR’s may reduce the retention period; but they are all responding to the same basic challenge of “Big Data”.

We now have three panelists:

It promises to be a fun evening.

I have some thoughts on subjects for the next meeting, if you have ideas please post a comment here.

On MapReduce and Relational Databases – Part 1

Describes MapReduce and why WOTS (Wart-On-The-Side) MapReduce is bad for databases.

This is the first of a two-part blog post that presents a perspective on the recent trend to integrate MapReduce with Relational Databases especially Analytic Database Management Systems (ADBMS).

The first part of this blog post provides an introduction to MapReduce, provides a short description of the history and why MapReduce was created, and describes the stated benefits of MapReduce.

The second part of this blog post provides a short description of why I believe that integration of MapReduce with relational databases is a significant mistake. It concludes by providing some alternatives that would provide much better solutions to the problems that MapReduce is supposed to solve.
Continue reading “On MapReduce and Relational Databases – Part 1”

On MapReduce and Relational Databases – Part 2

This is the second of a two-part blog post that presents a perspective on the recent trend to integrate MapReduce with Relational Databases especially Analytic Database Management Systems (ADBMS).

The first part of this blog post provides an introduction to MapReduce, provides a short description of the history and why MapReduce was created, and describes the stated benefits of MapReduce.

The second part of this blog post provides a short description of why I believe that integration of MapReduce with relational databases is a significant mistake. It concludes by providing some alternatives that would provide much better solutions to the problems that MapReduce is supposed to solve.
Continue reading “On MapReduce and Relational Databases – Part 2”

Announcing the Boston Big Data Summit

Announcement for the kickoff of the Boston “Big Data Summit”. The event will be held on Thursday, October 22nd 2009 at 6pm at the Emerging Enterprise Center at Foley Hoag in Waltham, MA. Register at http://bigdata102209.eventbrite.com

BBD_logo

The Boston “Big Data Summit” will be holding its first meeting on Thursday, October 22nd 2009 at 6pm at the Emerging Enterprise Center at Foley Hoag in Waltham, MA.

The Boston area is home to a large number of companies involved in the collection, storage, analysis, data integration, data quality, master data management, and archival of “Big Data”. If you are involved in any of these, then the meeting of the Boston “Big Data Summit” is something you should plan to attend. Save the date!

The first meeting of the group will feature a discussion of “Big Data” and the challenges of “Big Data” analysis in the cloud.

Over 120 people signed up as of October 14th 2009.

There is a waiting list. If you are registered and won’t be able to attend, please contact me so we can allow someone on the wait list to attend instead.

Seating is limited so go online and register for the event at http://bigdata102209.eventbrite.com.

The Boston “Big Data Summit” thanks the Emerging Enterprise Center at Foley Hoag LLP for their support and assistance in organizing this event.

Agenda Updated

The Boston “Big Data Summit” is being sponsored by Foley Hoag LLP, Infobright, Expressor  Software, and Kalido

For more information about the Boston “Big Data Summit” please contact the group administrator at boston.bigdata@gmail.com


The Boston Big Data Summit is organized by Bob Zurek and Amrith (me) in partnership with the Emerging Enterprise Center at Foley Hoag LLP.


Faster or Free

I don’t know how Bruce Scott’s article showed up in my mailbox but I’m confused by it (happens a lot these days).

I agree with him that too much has been made about whether a system is a columnar system or a truly columnar system or a vertically partitioned row store and what really matters to a customer is TCO and price-performance in their own environment. Bruce says as much in his blog post

Let’s start talking about what customers really care about: price-performance and low cost of ownership. Customers want to do more with less. They want less initial cost and less ongoing cost.

Then, he goes on to say

On this last point, we have found that we always outperform our competitors in customer created benchmarks, especially when there are previously unforeseen queries. Due to customer confidentiality this can appear to be a hollow claim that we cannot always publicly back up with customer testimonials. Because of this, we’ve decided to put our money where our mouth is in our “Faster or Free” offer. Check out our website for details here: http://www.paraccel.com/faster_or_free.php

So, I went and looked at that link. There, it says:

Our promise: The ParAccel Analytic Database™ will be faster than your current database management system or any that you are currently evaluating, or our software license is free (Maintenance is not included. Requires an executed evaluation agreement.)

To be consistent, should that not make the promise that the ParAccel offering would provide better price-performance and lower TCO than the current system or the one being evaluated? After all, that is what customers really care about.

I’m confused. More coffee!

Oh, there’s more! Check out this link http://www.paraccel.com/cash_for_clunkers.php

Talk about fine print:

* Trade-in value is equivalent to the first year free of a three year subscription contract based on an annual subscription rate of $15K/user terabyte of data. Servers are purchased separately.

ParAccel TPC-H 30TB results challenged!

Watch the feeding frenzy now that ParAccel’s TPC-H 30TB results have been challenged.

Before I had my morning cup of coffee, I found an email message with the subject “ParAccel ADVISORY” sitting in my mail box. Now, I’m not exactly sure why I got this message from Renee Deger of GlobalFluency so my first suspicion was that this was a scam and that someone in Nigeria would be giving me a million dollars if I did something.

But, I was disappointed. Renee Deger is not a Nigerian bank tycoon who will make me rich. In fact, ParAccel’s own blog indicates that their 30TB results have been challenged.

We wanted you to hear it from us first.  Our TPC-H Benchmark for performance and price-performance at the 30-terabyte scale was taken down following a challenge by one of our competitors and a review by the TPC.  We executed this benchmark in collaboration with Sun and under the watch of a highly qualified and experienced auditor.   Although it has been around for many years, the TPC-H specification is still subject to interpretation, and our interpretation of some of the points raised was not accepted by the TPC review board.

None of these items impacts our actual performance, which is still the best in the industry.  We will rerun the benchmark at our earliest opportunity according to the interpretations set forth by the TPC review board. We remain committed to the organization, as outlined in a blog post by Barry Zane, our CTO, here: http://paraccel.com/data_warehouse_blog/?p=74#more-74.

Please see the company blog for a post by David Steinhoff in the office of the CTO for further info: http://paraccel.com/data_warehouse_blog/?p=104#more-104

I read David Steinhoff’s blog as well. He writes

This last week, our June 2009 30TB results were challenged and found to be in violation of certain TPC rules. Accordingly, our results will soon be taken down from the TPC website.

We published our results in good faith. We used our standard, customer available database to run the benchmark (we wanted the benchmark to reflect the incredible performance our customers receive). However, the body of TPC rules is complex and undergoes constant interpretation; we are still relatively new to the benchmark game and are still learning, and we made some mistakes.

While we cannot get into the details of the challenges to our results (TPC proceedings are confidential and we would be in violation if we did), we can state with confidence that our query performance was in no way enhanced by the items that were challenged.

We can also say with confidence that we will publish again, soon.

Now, some competitor or competitor loyalist may try to make more of this than there is … we all know there is the risk of tabloid frenzy around any adversity in a society with free speech … and we wouldn’t have it any other way.

It is unfortunate that the proceedings are confidential and cannot be shared. I hope you republish your results at 30TB.

Contrary to some a long list of pundits, I believe that these benchmarks have an important place in the marketing of a product and its capabilities.

I reiterate what I said in a previous blog post

ParAccel’s solution is based on high-performance trilithium crystals. (Note: I don’t know why this wasn’t disclosed in the full disclosure report).

I hope the challenge was not about the trilithium crystals and the fact that you didn’t disclose it in the full disclosure report.