Hype Cycles

A farewell to facts, rigor and civility

I read this article from Mike (The Fein Line) Feinstein’s blog and it occurs to me that we have collectively chosen to ignore facts, rigor and civility.

For a whole bunch of reasons (beyond the one that Mike mentions), I have to say that I too am happy and proud that I live in Massachusetts.

OpenID first impressions

I have been meaning to try OpenID for some time now and I just noticed that they were doing a free TFA (what they call VIP Credentials) thing for mobile devices so I decided to give it a shot.

I picked Verisign’s OpenID offering; in the past I had a certificate (document signing) from Verisign and I liked the whole process so I guess that tipped the scales in Verisign’s favor.

The registration was a piece of cake, downloading the credential generator to my phone and linking it to my account was a breeze. They offer a File Vault (2GB) free with every account (Hey Google, did you hear that?) and I gave that a shot.

I created a second OpenID and linked it to the same mobile credential generator (very cool). Then I figured out what to do if my cell phone (and mobile credential generator were to be lost or misplaced), it was all very easy. Seemed too good to be true!

And, it was.

Facebook allows one to use an external ID for authentication. Go to Account Settings and Linked Accounts and you can setup the linkage. Cool, let’s give that a shot!

So much for that. I have an OpenID, anyone have a site I could use it on?

Oh yes! I could login to Verisignlabs with my OpenID 🙂

Update:

I tried to link my existing “Hacker News” (news.ycombinator.com) account with OpenID and after authenticating with verisign, I got to a page that asked me to enter my HN information which I did.

I ended up with a page: http://news.ycombinator.com/openid_merge and a single word “Unknown” on the screen.

I’ve got to be doing something wrong. Someone care to tell me how badly messed up I am?

Update (sept 11)

Thanks to help from Gary (who commented on this post), I tried the “linking” on Facebook again and this time it worked a little better.

But, I still have to enter my password when I want to login to facebook. Something is still not working the way it should.

Still the same issue with Hacker News.

What I learnt from the GMAIL outage

We all have heard about it, many of us (most of us) were affected by it, some of us actually saw it. This makes it a fertile subject for conversation; in person and over a cold pint, or online. I have read at least a dozen blog posts that explain why the GMAIL outage underscores the weakness of, and the reason for imminent failure of cloud computing. I have read at least two who explain why this outage proves the point that enterprises must have their own mail servers. There are graphs showing the number of tweets at various phases of the outage. There are articles about whether GMAIL users can sue Google over this failure.

The best three quotes I have read in the aftermath of the Gmail outage are these:

“So by the end of next May, we should start seeing the first of the Google Outage babies being born.” – Carla Levy, Systems Analyst

“Now I don’t look so silly for never signing up for an e-mail address, do I?” – Eric Newman, Pile-Driver Operator

“Remember the time when 150 million people couldn’t use Gmail for nearly ten years? From 1993–2003? And every year before that? Unimaginable.” – Adam Carmody, Safe Installer

Admittedly, all three came from “The Onion“.

This article is about none of those things. To me, the GMAIL outage could not have come at a better time. I have just finished reconfiguring where my mail goes and how it gets there. The outage gave me a chance to make sure that all the links worked well.

I have a GMAIL account and I have email that comes to other (non-GMAIL) addresses. I use GMAIL as a catcher for the non-GMAIL addresses using the “Imports and Forwarding” capability of GMAIL. That gives me a single web based portal to all of my email. The email is also POP3’ed down to a PC, the one which I am using to write this blog post. I get to read email on my phone (using its POP3 capability) from my GMAIL account. Google is a great backup facility, a nice web interface, and a single place where I can get all of my email. And, if for any reason it were to go kaput, as it did on the 1st, in a pinch, I can get to the stuff in a second or even a third place.

But, more importantly, if GMAIL is unavailable for 100 minutes, who gives a crap. Technology will fail. We try to make it better but it will still fail from time to time. Making a big hoopla about it is just plain dumb. On the other hand, an individual could lose access to his or her GMAIL for a whole bunch of reasons; not just because Google had an outage. Learn to live with it.

So what did I learn from the GMAIL outage? It gave me a good chance to see a bunch of addicts, and how they behave irrationally when they can’t get their “fix”. I’m a borderline addict myself (I do read email on my phone, as though I get things of such profound importance that instant reaction is a matter of life and death). The GMAIL outage showed me what I would become if I did not take some corrective action.

Technology has given us the means to “shrink the planet” and make a tightly interconnected world. With a few keystrokes, I can converse with a person next door, in the next state or half way across the world. Connectivity is making us accessible everywhere; in our homes, workplaces, cars, and now, even in an aircraft. It has given us the ability to inundate ourselves with information, and many of us have been over-indulging (to the point where it has become unhealthy).

Full Corn Moon

Today was the “Full Corn Moon” and I was lucky enough to get a clear night.

You can see a larger image by clicking on the picture above.

The moon is barely visible in the first image, hidden by the trees. In the second and the third, it makes it out of the trees just as the sun is setting behind me.

And here I was complaining about Verizon Wireless!

I thought poorly of Verizon Wireless service and features (though I’ve been a customer for a while).

That all changed when I read this.

Hey Verizon, where do I sign up for that two year contract? But, could you give me a cellular data plan with more than 5GB per month please …

Moonrise!

I have never had too much luck with photographs of the moon. That may just have changed 🙂

Till January 1, 2010, Bye-Bye Linux!

Bye-Bye Ubuntu! Back to Windows …

Around the New Year each year, the fact that I am bored silly leads me to do strange things. For the past couple of years, in addition to drinking a lot of Samuel Adams Double Bock or Black Lager, I kick Windows XP, Vista or whatever Redmond has to offer and install Linux on my laptop.

For two years now, Ubuntu has been the linux of choice. New Year 2009 saw me installing 8.10 (Ignorant Ignoramus) and later upgrading to 9.04 (Jibbering Jackass). But, I write this blog post on my Windows XP (Service Pack 3) powered machine.

Why the change, you ask?

This has arguably been the longest stint with Linux. In the past (2007) it didn’t stay on the PC long enough to make it into work after the New Year holiday. In 2008, it lasted two or three weeks. In 2009, it lasted till the middle of August! Clearly, Linux (and Ubuntu has been a great part of this) has come a very long way towards being a mainstream replacement for Windows.

But, my benchmark for ease of use still remains:

Ease of initial installation
- On Windows, stick a CD in the drive and wait 2 hours
- On Linux, stick a CD in the drive and wait 20 minutes
- Click mouse and enter some basic data along the way
Ease of setup, initial software update, adding basic software that is not part of the default distribution
- On Windows, VMWare (to run linux), Anti-Virus, Adobe things (Acrobat, Flash, …)
- On Linux, VMWare (to run windows), Adobe things
Ease of installing and configuring required additional “stuff”, additional drivers
- printers
- wacom bamboo tablet
- synchronization with PDA (Windows ActiveSync, Linux <sigh>)
- On Windows, DELL drivers for chipset, display, sound card, pointer, …
Configuring Display
- resolution, alignment
Configuring Mouse and Buttons
Making sure that docking station works
- On Windows, DELL has some software to help with this
- On Linux, pull your hair out
Setting Power properties for maximum battery life
- On Windows, what a pain
- On Linux, CPU Performance Applet
Making sure that I login and can work as a non-dangerous user
- On Windows, group = Users
- On Linux, one who can not administer the system, no root privileges
Setup VPN
- On Windows, CISCO VPN Client most often. Install it and watch your PC demonstrate all the blue pixels on the screen
- On Linux, go through the gyrations of downloading Cisco VPN client from three places, reading 14 blogs, web pages and how-to’s on getting the right patches, finding the compilers aren’t on your system, finding that ‘patch’ and system headers are not there either. Finally, realizing that you forgot to save the .pcf file before you blew Windows away so calling IT manager on New Year’s day and wishing him Happy New Year, and oh, by the way, could you send me the .pcf file (Thanks Ed).
Setup Email and other Office Applications
- On Linux, installing a Windows VM with all of the Office suite and Outlook
- On Windows, installing all of the Office suite and Outlook and getting all the service packs
- Install subversion (got to have everything under version control). There’s even a cool command line subversion client for Windows (Slik Subversion 1.6.4)
Migrate Mozilla profile to new platform
- Did you know that you can literally take .mozilla and copy it to someplace in %userprofile% or vice-versa and things just work? Way cool! Try that with Internet Exploder!
Restore SVN dump from old platform

OK, so I liked Linux for the past 8 months. GIMP is wonderful, the Bamboo tablet (almost just works ™), system booted really fast, … I can go on and on.

But, some things that really annoyed me with Linux over the past 8 months

Printing to the Xerox multi function 7335 printer and being able to do color, double sided, stapling etc., The setup is not for the faint hearted
Could I please get the docking station to work?
Could you please make the new Mozilla part of the updates? If not, I have Firefox and Shrill-kokatoo or whatever the new thing is called. What a load of horse-manure that upgrade turned out to be. On Windows, it was a breeze. Really, open-source-brethren, could you chat amongst yourselves?

But the final straw was that I was visiting a friend in Boston and wanted to whip out a presentation and show him what I’d been up to. External display is not an easy thing to do. First you have to change resolutions, then restart X, then crawl through a minefield, sing the national anthem backwards while holding your nose. Throughout this “setup”, you have to be explaining that it is a Linux thing.

Sorry folks, you aren’t ready for mainstream laptop use, yet. But, you’ve made wonderful improvement since 2007. I can’t wait till December 31, 2009 to try this all over again with Ubuntu 9.10 (Kickass Knickerbockers).

OpenDNS again

Content Filtering

Nice, they support content filtering as part of the setup. That could be useful. Right now, I reduce annoyances on my browsing experience with a suitably crafted “hosts” file (Windows Users: %SYSTEMROOT%system32driversetchosts).

127.0.0.1       localhost
127.0.0.1       ad.doubleclick.net
127.0.0.1       hitbox.com
127.0.0.1       ai.hitbox.com
127.0.0.1       googleads.g.doubleclick.net
127.0.0.1       ads.gigaom.com
127.0.0.1       ads.pheedo.com
[... and so on ...]

I guess I can push this “goodness” over to OpenDNS and reduce the pop-up crap that everyone will get at home. (click on image for a higher resolution version of the screen shot)

Multiple Networks!

Very cool! I can setup multiple networks as part of a single user profile. So, my phone and my home router could both end up being protected by my OpenDNS profile.

I wonder how that would work when I’m in a location that hands out a non-routable DHCP address; such as at a workplace. I guess the first person to register the public IP of the workplace will see traffic for everyone in the workplace with a per-PC OpenDNS setting that shares the same public IP address? Unclear, that may be awkward.

Enabling OpenDNS on a Per-PC basis.

In last nights post, I had questioned the rationale of enabling OpenDNS on a per-PC basis. I guess there is some value to this because OpenDNS provides me a way to influence the name resolution process. And, if I were to push content filtering onto OpenDNS, then I would like to get the same content filtering when I was not at home; e.g. at work, at Starbucks, …

I’m sure that over-anxious-parents-who-knew-a-thing-or-two-about-PC’s could load the “Dynamic IP” updater thing on a PC and change the DNS entries to point to OpenDNS before junior went away to college 🙂

So, I guess that per-PC OpenDNS settings may make some sense; it would be nice to have an easy way to enable this when required. I guess that is a fun project to work on one of these days when I’m at Starbucks.

Jeremiah says, “I do it on a per computer basis because I occasionally need to disable it. (Mac OS X makes this super quick with Locations)”. Jeremiah, please do tell why you occasionally need to disable it. Does something fail?

Other uses of OpenDNS

kuzux writes in response to my previous post that OpenDNS can be used to get around restrictive ISP’s. That is interesting because the ISP’s that have put these restrictions in place are likely only blocking name resolution and not connection and traffic. Further, the ISP’s could just as well find the IP addresses of the sites like OpenDNS and put a damper on the festivities. And, one does not have to look far to get the IP addresses of the OpenDNS servers 🙂

Two thoughts come to mind. First, if the authorities (in Turkey as Kuzux said) put the screws on OpenDNS, would they pour out the DNS lookup logs for specific IP addresses that they cared about (both source and destination). Second, a hypothetical country with huge manufacturing operations, a less stellar human rights record, and a huge number of take-out restaurants all over the US (that shall remain nameless), could take a dim view of a foreigner who had OpenDNS on his/her laptop and was able to access “blocked” content.

Other comments

Janitha Karunaratne writes in response to my previous post that, “Lot of times if it’s a locked down limited network, they will intercept all DNS traffic, so using OpenDNS won’t help (their own default DNS server will reply no matter which DNS server you try to reach)”. I guess I don’t understand how that could be. When a machine attempts a DNS lookup, it addresses the packet specifically to the DNS server that it is targeting. Are you suggesting that these “locked down limited networks” will intercept that packet, redirect it to the in-house DNS server and have it respond?

David Ulevitch (aka Founder and CTO of OpenDNS) writes, “Yeah, there are all kinds of reasons people use our service. Speed, safety, security, reliability… I do tests when I travel, and have even done it with GoGo on a VA flight and we consistently outperform”. Mr. Ulevitch, your product is wonderful and easy to use. Very cool. But, I wonder about this performance claim. When I am traveling, potentially sitting in an airport lounge, a hotel room, a coffee shop or in a train using GPRS based internet service with unknown bandwidth, is the DNS lookup a significant part of the response time to a page refresh, mail message download, (insert activity of your choice)?

My Point of View

It seems to work, it can’t hurt to use it at home (if my ISP has a problem with it, they can block traffic to the IP address). It doesn’t seem to be appreciably faster or slower than my ISP’s DNS. I’ll give it a shot for a while and see what the statistics say (when they get around to updating them).

OpenDNS is certainly an easy to use, non-disruptive service and is worth giving a shot. If you use the free version of OpenDNS (ie don’t create an account, just point to their name servers), there is little possible downside; if you get on a Virgin Atlantic flight, you may need to disable it. But, if you use the registered site, just remember that OpenDNS is collecting a treasure trove of information about where you go, what you do, and they have your email address, IP address (hence a pretty good idea of where you live). They already target advertising to you on the default landing page for bad lookups. I’m not suggesting that you will get spammed up the wazoo but just bear in mind that you have yet another place where a wealth of information about you is getting quietly stored away for later use.

But, it is a cool idea. Give it a shot.

OpenDNS and paid wireless services.

Why would anyone ever want to use OpenDNS? I just don’t get it? But, there must be a good reason that I’m totally missing.

I stumbled on this post, not sure how.

http://thegongshow.tumblr.com/post/176629519/virgin-america-inflight-wireless

I didn’t know about OpenDNS but it seems like a strange idea; after all, why would someone not use the DNS server that came with their DHCP lease? Seems like a fairly simple idea, over-ride the DNS servers and force the choice to be to use the DNS servers that OpenDNS provides, but why would anyone care to do this on a PC by PC basis? That seems just totally illogical!

And the problem that the author of the blog (above) mentioned is fairly straightforward. PC comes on the Virgin Wireless network, attempts to go to a web page, sends out a connection request for a page (assume that it is from a cache of IP addresses) and receives a HTML redirect to a named page (asking one to cough up money). The HTML redirect mentions the page by name and that name is not cached and results in a DNS lookup which goes to OpenDNS. Those servers (hardcoded into the network configuration) are not accessible because the user has not yet coughed up said money. Conversely, If the initial lookup was not from a cached IP address then the DNS lookup (which would have been sent to OpenDNS) would have not made it very far (OpenDNS’s server is not reachable till you cough up money). One way or the other, the page asking the user to cough up cache would not have showed up.

So, could one really do this OpenDNS thing behind a pay-for-internet-service? Not unless you can add preferred DNS servers to network configuration without a reboot. (the reboot will get the cycle of DHCP request going again, and on the high volume WiFi access services, DHCP request will automatically expire the previous lease).

But, to the more basic issue, why ever would someone enable this on a PC-by-PC basis? I can totally understand a system administrator using this at an enterprise level; potentially using the OpenDNS server instead of the ones that the ISP provided. Sure beats me! And there sure can’t be enough money in showing advertisements on the redirect page for missing URL’s (or there must be a lot of people who fat-finger URL’s a lot more than I do).

And, the functionality of being able to watch my OpenDNS traffic on a dashboard, I just don’t get it. Definitely something to think more about … They sure seem to be a successful operation so there must clearly be some value to the service they offer, to someone.

When “free” isn’t quite “free”. Or, remember to read the fine print on online photo sharing sites.

A quick comparison of some online photo sharing sites.

I have been a happy user of Flickr (I use the “frugal” account) till yesterday when I got a big pop-up box about a 200 image limit. Apparently flickr does offer unlimited free image storage but the fine print says that only the most recent 200 can be shared. Not the worst thing in the world but I began to think of all the other things I did not like about Flickr and so I started to look around and see whether I had some other alternatives.

There are any number of online image sharing sites. Many of them are also linked with photo printing services, and arguably that is where the money is. It appears that the devil is most certainly hiding in the fine print. Here is a short summary of what I found. I’m planning to move to Shutterfly; do you have some experience with them which makes this a bad idea?

Free Account

Paid Accounts

Flickr

http://www.flickr.com/help/limits/

Free account has unlimited (modest resolution) image
storage. No more than 200 can be shared at a time. There are upload limits on the number and size of pictures that you can upload.

Unlimited upload, storage and high resolution storage.

$24.95 per year

Photobucket

http://photobucket.com/faq?catID=29&catSelected=f&topicID=320

http://photobucket.com/faq?catID=41&catSelected=f&topicID=323

http://photobucket.com/faq?catID=39&catSelected=f&topicID=520

Free account has limited (modest resolution) images.

Periodic logins are required; failure to do so will deactivate
media

Unlimited capacity, FTP uploads (what a concept), no
advertising, personal URL’s

$24.95 per year

Shutterfly

http://www.shutterfly.com

Free unlimited storage of pictures. Images stored at high resolution. But you cannot download at high resolution. Personalized web portal.

I don’t think they even offer a paid account option. My
kind of place!

Snapfish (HP)

References

http://getsatisfaction.com/snapfish/topics/downloading_snapfish_photos

http://www1.snapfish.com/helppricing#hires

Snapfish offers unlimited photo sharing and storage. Customers must be “active”. The bar for an active customer is that you just need to
make one purchase a year.

But read this
link. You can store high resolution images but downloading high
resolution images is not free.

OUCH!

No paid account.

Picasa (Google)

1GB limit (seems odd for the company that claims that
storage is unlimited).

Other limits also apply.

No paid offering that I could find.

Smugmug

No ads! Now, isn’t this a great graphic to illustrate the success in targeted advertising and reinforce their point?

Reference

http://www.smugmug.com/photos/photo-sharing-sites-compared/

No free offering. There is a 14 day free trial.

Standard: $39.95/year

Power: $59.95/year

Pro: $149.95/year

Winkflash

http://www.winkflash.com/

http://www.winkflash.com/content/storage.asp

Free image hosting.

Free unlimited storage.

Free high resolution image downloads.

100% FREE

Too good to be true?

MPIX

http://mpix.com/

Free site for 60 days. After that, need an order to keep
content online.

MPIX is a professional print outfit; online sharing is not
their primary business.

WHCC (White House Custom Color)

http://www.whcc.com/

WHCC is a professional print outfit. They don’t do photo
galleries and sharing stuff. Go here if you want serious prints.

Not offered.

I think I’m heading to shutterfly. On Sept 5th I found this Shutterfly article (answer id 181)

“Currently, we do not have full resolution downloading available. However, we do have an Archive DVD service that you can order which contains full resolution copies of your pictures. The images on an Archive DVD will not include any of the Shutterfly enhancements or rotations that have been applied to an image loaded to Shutterfly.”

What BS is this? I guess I’m going to stick with Flickr for a while longer.

Also read http://forums.dpreview.com/forums/read.asp?forum=1023&message=32634142

Making life interesting

I generally don’t like chain letters and SPAM but once in a while I get a real masterpiece. Below is one that I received yesterday …

Working people frequently ask retired people what they do to make their days interesting. Well, for example, the other day the wife and I went into town and went into a shop. We were only in there for about 5 minutes.

When we came out, there was a cop writing out a parking ticket. We went up to him and I said, “Come on man, how about giving a senior citizen a break?”

He ignored us and continued writing the ticket. I called him a Dumb ass. He glared at me and started writing another ticket for having worn tires. So Mary called him a shit head. He finished the second ticket and put it on the windshield with the first.

Then he started writing a third ticket. This went on for about 20 minutes.

The more we abused him, the more tickets he wrote.

Just then our bus arrived and we got on it and went home.

We try to have a little fun each day now that we`re retired.

It`s important at our age.

Priceless!

I’ve lost my cookies!

Some days ago I read an article (a link is on the Breadcrumbs tab on the right of my blog about Super Cookies). I didn’t think to check my machine for these super cookies and did not pay close attention to the lines that said:

GNU-Linux: ~/.macromedia

Sure enough …

amrith@amrith-laptop:~$ find . -name *.sol 2>/dev/null | wc -l
141
amrith@amrith-laptop:~$

Hmmm …

amrith@amrith-laptop:~$ rm `find . -name *.sol 2>/dev/null `
amrith@amrith-laptop:~$ find . -name *.sol 2>/dev/null | wc -l
0
amrith@amrith-laptop:~$

Much better. And my flash player still works. Let’s go look at some video …

amrith@amrith-laptop:~$ find . -name *.sol 2>/dev/null
./.macromedia/Flash_Player/macromedia.com/support/flashplayer/sys/settings.sol
./.macromedia/Flash_Player/macromedia.com/support/flashplayer/sys/#s.ytimg.com/settings.sol
./.macromedia/Flash_Player/#SharedObjects/CZRS8QS7/s.ytimg.com/soundData.sol
./.macromedia/Flash_Player/#SharedObjects/CZRS8QS7/s.ytimg.com/videostats.sol
amrith@amrith-laptop:~$

Time to fix this sucker …

amrith@amrith-laptop:~$ tail -n 1 .bashrc
rm -f `find ./.macromedia -name *.sol 2>/dev/null`
amrith@amrith-laptop:~$

Boston Cloud Services meetup yesterday

summary of boston cloud services meetup yesterday.

Tsahy Shapsa of aprigo organized the second Boston Cloud Services meetup yesterday. There were two very informative presentations, the first by Riki Fine of EMC on the EMC Atmos project and the second by Craig Halliwell from TwinStrata.

What I learnt was that Atmos was EMC’s entry into the cloud arena. The initial product was a cloud storage offering with some additional functionality over other offerings like Amazon’s. Key product attributes appear to be scalability into the petabytes, policy and object metadata based management, multiple access methods (CIFS/NFS/REST/SOAP), and a common “unified namespace” for the entire managed storage. While the initial offering was for a cloud storage offering, there was a mention of a compute offering in the not too distant future.

In terms of delivery, EMC has setup its own data centers to host some of the Atmos clients. But, they have also partnered with other vendors (AT&T was mentioned) who would provide an cloud storage offerings that exposed the Atmos API. AT&T’s web page reads

AT&T Synaptic Cloud Storage uses the EMC Atmos™ backend to deliver an enterprise-grade global distribution system. The EMC Atmos™ Web Services API is a Web service that allows developers to enable a customized commodity storage system over the Internet, VPNs, or private MPLS connectivity.

I read this as a departure from the approach being taken by the other vendors. I don’t believe that other offerings (Amazon, Azure, …) provide a standardized API and allow others to offer cloud services compliant to that interface. In effect, I see this as an opportunity to create a marketplace for “plug compatible” cloud storage. Assume that a half dozen more vendors begin to offer Atmos based cloud storage, each offering a different location, SLA’s and price point, an end user has the option to pick and choose from that set. To the best of my knowledge, today the best one can do is pick a vendor and then decide where in that vendor’s infrastructure the data would reside.

Atmos also seems to offer some cool resiliency and replication functionality. An application can leverage a collection of Atmos storage providers. Based on policy, an object could be replicated (synchronously or asynchronously) to multiple locations on an Atmos cloud with the options of having some objects only within the firewall and others being replicated outside the firewall.

Enter TwinStrata who are an Atmos partner. They have a cool iSCSI interface to the Atmos cloud storage. With a couple of clicks of a mouse, they demonstrated the creation of a small Atmos based iSCSI block device. Going over to a windows server machine and rescanning disks they found the newly created volume. A couple of clicks later there was a newly minted “T:” that the application could use, just as it would a piece of local storage. TwinStrata provides some additional caching and ease of use features. We saw the “ease of use” part yesterday. The demo lasted a couple of minutes and no more than about a dozen mouse clicks. The version that was demo’ed was the iSCSI interface, there was talk of a file system based interface in the near future.

Right now, all of these offerings are expected to be for Tier-3 storage. Over time, there is a belief that T2 and T1 will also use this kind of infrastructure.

Very cool stuff! If you are in the Boston area and are interested in the Cloud paradigm, definitely check out the next event on Sept 23rd.

Pizza and refreshments were provided by Intuit. If you haven’t noticed, the folks from Intuit are doing a magnificent job fostering these kinds of events all over the Boston Area. I have attended several excellent events that they have sponsored. A great big “Thank You” to them!

Finally, a big “Thank You” to Tsahy and Aprigo for arranging this meetup and offering their premises for the meetings.

More sunsets

The sunset this evening was quite spectacular. The picture (panoramic) was taken from Prospect Hill Road in Harvard, MA.

Click on the image to see a higher resolution picture on flickr. The image here has been cropped to make sure it shows up properly on the blog interface.

Are you getting the whole picture?

Human field of vision, the shortcomings of simple camera, and how to take breathtaking pictures with a simple point-and-shoot camera.

While taking pictures, the field of vision is something that is often overlooked. A normal point and click camera has a field of vision of about 40°x35°. But, the human eye(s) provide you with a field of vision that is almost 200°x130°. Very often, you come upon a sight that is breathtaking and you whip out your camera and shoot some pictures. When you get back home and look at the pictures on a PC monitor, they don’t look quite the same.

To get some idea of what a short focus length lens (wide field of vision) can do for you, take a look at this picture on Ken Rockwell’s web page. The image that I would like you to look at is here. This awesome image is copyrighted by KenRockwell.com. If you are a photo buff, you should bookmark kenrockwell.com and subscribe to the RSS feed. I find it absolutely invaluable.

I don’t have this kind of amazing 13mm lens but a panoramic image using stitching can produce a similar field of view.

Panoramic images are a very cost effective way to get pictures with a very wide field of vision. If you are interested in all the science and technology behind the process of converting multiple segments of an image into a single panoramic image, you can refer to the FAQ at AutoStitch. There is an interesting paper on how all this works that you can read here and there is an informative presentation that goes with that paper.

Panoramic images are also better than short focal length lenses because there is less distortion towards the edges. Notice that the houses at the right and left edge of the first image above appear to be leaning. With panoramic stitching these effects can be eliminated.

Some quick tips if you plan to take a panoramic picture.

Set the camera to manual exposure mode to reduce the corrections that need to be done in software.
Use a tripod and make sure that you get a complete coverage of the area that you want to stitch.
Make sure that you overlap images by about a third. I usually turn on the visible grid in the view finder to help with this.
Take lots of pictures, there is nothing to beat practice.

In my previous post some panoramic sunsets were shown. I took several sets of pictures, such as the five below. These were then stitched together using a software called Autostitch. You can get a copy of autostitch at http://www.autostitch.net/

Enjoy!

Sunset in Rye, New Hampshire

I got two interesting pictures of the sunset in Rye, NH today. The two pictures were each made by tiling 5 images using autostitch.

The two pictures are five minutes apart and the colors changed quite interestingly in that 5 minute interval.

And about five minutes later

I have reduced the size on these images to 70% so that they show up ok on the blog page. Larger images are on flickr (use the left panel).

Look Ma! NoSQL!

More musings on NoSQL and a blog I read “NoSQL: If Only It Was That Easy”

There has definitely been more chatter about NoSQL in the Boston area lately. I hear there is a group forming around NoSQL ( I will post more details when I get them ). There were some NoSQL folks at the recent Cloud Camp which I was not able to attend (damn!).

My views on NoSQL are unchanged from an earlier post on the subject. I think there are some genuine issues about database scaling that are being addressed through a variety of avenues (packages, tools, …). But, in the end, the reason that SQL has survived for so long is because it is a descriptive language that is reasonably portable. That is also the reason why, in the data warehousing space, you have each vendor going off and doing some non-SQL extension in a totally non-portable way. And they are all going to, IMHO, have to reconcile their differences before the features get wide mainstream adoption.

This morning I read a well researched blog post by BJ Clark by way of Hacker News. (If you don’t use HN, you should definitely give it a try).

I strongly recommend that if you are interested in NoSQL, you read the conclusion section carefully. I have annotated the conclusion section below.

“NoSQL is a great tool, but it’s certainly not going to be your competitive edge, it’s not going to make your app hot, and most of all, your users won’t give a shit about any of this.

What am I going to build my next app on? Probably Postgres.

Will I use NoSQL? Maybe. [I would not, but that may just be my bias]

…

I might keep everything in flat files. [Yikes! If I had to do this, I’d consider something like MySQL CSV first]

…

If I need reporting, I won’t be using any NoSQL.

…

If I need ACIDity, I won’t use NoSQL.

…

If I need transactions, I’ll use Postgres.

…”

NoSQL is a great stepping stone, what comes next will be really exciting technology. If what we need is a database that scales, let’s go make ourselves a database that scales. Base it on MySQL, PostgreSQL, … but please make it SQL based. Extend SQL if you have to. I really do like to be able to coexist with the rich ecosystem of visualization tools, reporting tools, dashboards, … you get the picture.

On over-engineered solutions

The human desire to over-engineer solutions

On a recent trip, I had way too much time on my hands and was ambling around the usual overpriced shops in the airport terminal. There I saw a 40th anniversary “special edition” of the Space Pen. For $799.99 (plus taxes) you could get one of these pens, and some other memorabilia to commemorate the 40th anniversary of the first lunar landing. If you aren’t familiar with the Space Pen, you can look at learn more at the web site of the Fisher Space Pen Co.

For $800 you could get AG7-40LE – Astronaut Space Pen 40th Year Moon Landing Celebration Commemorative Pen & Box.

Part of this pen actually circled the moon!

Salient features of this revolutionary device:

it writes upside down
it writes in any language (so the Russians bought some of these :))
it draws pictures
it writes in zero gravity
it writes under water
it writes on greasy surfaces
it fixes broken switches on lunar modules

If you want to know about the history of this device, you can also look at NASA’s web page here. On the web page of the Fisher Space Pen Co, you can also see the promotion of the AG7-40 – Astronaut Space Pen 40th Year Moon Landing Celebration Engraving.

While in the airport, I saw a couple of security officers rolling around on the Segway Personal Transporter. Did you know that for approximately $10,000 you could get yourself a Segway PT i2 Ferrari Limited Edition? I have no idea how much the airport paid for them but the Segway i2 is available on Amazon for about $6,000. It did strike me as silly, till I noticed three officers (a lot more athletic) riding through the terminal on Trek bicycles. That seemed a lot more reasonable. I have a bicycle like that, and it costs maybe 10% of a Segway.

I got thinking about the rampant over-engineering that was all around me, and happened upon this web page, when I did a search for Segway!

How to render the Segway Human Transporter Obsolete by Maddox.

Who would have thought of this, just add a third wheel and you could have a vehicle just as revolutionary? I thought about it some more and figured that the third wheel in Maddox’s picture is probably not the best choice; maybe it should be a wheel more like the track-ball of a mouse. That would have no resistance to turning and the contraption could do an effortless zero radius turn if required. The ball could be spring loaded and the whole thing could be on an arm that had some form of shock absorbing mechanism.

And, if we were to have done that, we would not have seen this. Bush Fails Segway Test! As an interesting aside, did you know that the high tech heart stent used for people who have bypass surgery was also invented by Dean Kamen?

We have a million ways to solve most problems. Why then do we over-engineer solutions to the point where the solution introduces more problems?

Keep it simple! There are less things that can break later.

Oh, and about that broken switch in the lunar module. Buzz Aldrin has killed that wonderful urban legend by saying he used a felt tip pen.

Buzz Aldrin still has the felt-tipped pen he used as a makeshift switch needed to fire up the engines that lifted him and fellow Apollo 11 astronaut Neil Armstrong off the moon and started their safe return to Earth nearly 40 years ago.

“The pen and the circuit breaker (switch) are in my possession because we do get a few memorabilia to kind of symbolize things that happened,” Aldrin told reporters Friday.

If Buzz Aldrin used a felt tipped pen, why do we need a Space Pen? And what exactly are we celebrating by buying a $800 pen that can’t even fix a broken switch. A pencil can write upside down (and also write in any language :)). Why do we need Segway Human Transporters in an airport when most security officers should be able to walk or ride a bicycle. Why do we build complex software products and make them bloated, unusable, incomprehensible and expensive?

That’s simple, we’re paying a homage to our overpowering desire to over-engineer solutions.

How to render the Segway Human Transporter obsolet

Xtremedata’s latest announcement

A short summary of the dbX announcement by xtremedata and ISA’s in data warehouse appliances

Xtremedata has announced their entry into the data warehouse appliance space. Called the dbX, this product is an interesting twist on the use of FPGA’s (Field Programmable Gate Arrays). Unlike the other data warehouse appliance companies, Xtremedata is not a data warehouse appliance startup. They are a six year old company with an established product line and a patent for the In Socket Accelerator that is part of dbX.

What is ISA?

The core technology that Xtremedata makes is a device called an In Socket Accelerator. Occupying a CPU or co-processor slot on a motherboard, this plug compatible device emulates a CPU using an FPGA. But, the ISA goes beyond the functionality provided by the CPU that it replaces.

Xtremedata’s patent (US Patent 7454550 – Systems and methods for providing co-processors to computing systems. issued in 2008) includes the following description

It is sometimes desirable to provide additional functionality/capability or performance to a computer system and thus, motherboards are provided with means for receiving additional devices, typically by way of “expansion slots.” Devices added tothe motherboard by these expansion slots communicate via a standard bus. The expansion slots and bus are designed to receive and provide data transport of a wide array of devices, but have well-known design limitations.

One type of enhancement of a computer system involves the addition of a co-processor. The challenges of using a co-processor with an existing computer system include the provision of physical space to add the co-processor, providing power to the co-processor, providing memory for the co-processor, dissipating the additional heat generated by the co-processor and providing a high-speed pipe for information to and from the co-processor.

Without replacing the socket, which would require replacing the motherboard, the CPU cannot be presently changed to one for which the socket was not designed, which might be desirable in providing features, functionality, performance and capabilities for which the system was not originally designed.

and

FPGA accelerators and the like, including counterparts therefore, e.g., application-specific integrated circuits (ASIC), are well known in the high-performance computing field. Nallatech, (see http://www.nallatech.com) is an example of several vendors that offer FPGA accelerator boards that can be plugged into standard computer systems. These boards are built to conform to industry standard I/O (Input/Output) interfaces for plug-in boards. Examples of such industry standards include: PCI (PeripheralComponent Interconnect) and its derivatives such as PCI-X and PCI-Express. Some computer system vendors, for example, Cray, Inc., (see http://www.cray.com) offer built-in FPGA co-processors interfaced via proprietary interfaces to the CPU.

What else does Xtremedata do?

Their core product is the In Socket Accelerator with applications in Financial Services, Medical, Military, Broadcast, High Performance Computing and Wireless (source: company web page). The dbX is their latest venture integrating their ISA in a standard HP server and providing “SQL on a chip”.

How does this compare with Netezza?

The currently shipping systems from Netezza also include an FPGA but, as described in their patent (US patent 709231000, Field oriented pipeline architecture for a programmable data streaming processor) places an FPGA between a disk drive and a processor (as a programmable data streaming processor). In the interest of full disclosure, I must also point out that I worked at Netezza and on the product described here. All descriptions in this blog are strictly based on publicly available information and references are provided.

Xtremedata, on the other hand, uses the FPGA in an ISA and emulates the processor.

So, the two are very different architectures, and accomplish very different things.

Other fun things that can be done with this technology 🙂

Similar technology was used to make a PDP-10 processor emulator. You can read more about that project here. Using similar technology and a Xilinx FPGA, the folks described in this project were able to make a completely functional PDP-10 processor.

Or, if you want a new Commodore 64 processor, you can read more about that project here.

Why are these links relevant? ISA’s and processor emulators seem like black magic. After all, Intel and AMD spend tons of money and their engineers spend a huge amount of time and resource to design and build CPU’s. It may seem outlandish to claim that one can reimplement a CPU using some device that one analyst called “Field Programmable Gatorade”. These two projects give you an idea of what people do with this Gatorade thing to make a processor.

And if you think I’m making up the part about Gatorade, look here and here (search for Gatorade, it is a long article).

About Xtremedata

XtremeData, Inc. creates hardware accelerated Database Analytics Appliances and is the inventor and leader in FPGA-based In-Socket Accelerators(ISA). The company offers many different appliances and FPGA-based ISA solutions for markets such as Decision Support Systems, Financial Analytics, Video Transcoding, Life Sciences, Military, and Wireless. Founded in 2003 XtremeData has established two centers: Headquarters are located in Schaumburg, IL (near Chicago, Illinois, USA) and a software development location in Bangalore, India.

XtremeData, Inc. is a privately held company.

source: company web page

My Opinion

This is a technology that I have been watching for some time now. It will be interesting to see how this technique compares (price, performance, completeness) with other vendors who are already in the field. By adopting industry standard hardware (HP) and being a certified vendor for HP (Qualified by HP’s accelerator program), and because this isn’t the only thing they do with these accelerators, it promises to be an interesting development in the market.

Wondering about “Shared Nothing”

Is “Shared Nothing” the best architecture for a database? Is Stonebraker’s assertion of 1986 still valid? Were Norman et al., correct in 1996 when they made a case for a hybrid system? Have recent developments changed the dynamics?

Over the past several months, I have been wondering about the Shared Nothing architecture that seems to be very common in parallel and massively parallel systems (specifically for databases) these days. With the advent of technologies like cheap servers and fast interconnects, there has been a considerable amount of literature that point to an an apparent “consensus” that Shared Nothing is the preferred architecture for parallel databases. One such example is Stonebraker’s 1986 paper (Michael Stonebraker. The case for shared nothing. Database Engineering, 9:4–9, 1986). A reference to this apparent “consensus” can be found in a paper by Dewitt and Gray (Dewitt, Gray Parallel database systems: the future of high performance database systems 1992).

A consensus on parallel and distributed database system architecture has emerged. This architecture is based on a shared-nothing hardware design [STON86] in which processors communicate with one another only by sending messages via an interconnection network.

But two decades or so later, is this still the case, is “Shared Nothing” really the way to go? Are there other, better options that one should consider?

As long ago as 1996, Norman, Zurek and Thanisch (Norman, Zurek and Thanisch. Much ado about shared nothing 1996) made a compelling argument for hybrid systems. But even that was over a decade ago. And the last decade has seen some rather interesting changes. Is the argument proposed in 1996 by Norman et al., still valid? (Updated 2009-08-02: A related article in Computergram in 1995 can be read at http://www.cbronline.com/news/shared_nothing_parallel_database_architecture)

With the advent of clouds and virtualization, doesn’t one need to seriously consider the shared nothing architecture again? Is a shared disk architecture in a cloud even a worthwhile conversation to have? I was reading the other day about shared nothing cluster storage. What? That seems like an contradiction, doesn’t it?

Some interesting questions to ponder:

In the current state of technology, are the characterizations “Shared Everything”, “Shared Memory”, “Shared Disk” and “Shared Nothing” still sufficient? Do we need additional categories?
Is Shared Nothing the best way to go (As advocated by Stonebraker in 1986) or is a hybrid system the best way to go (as advocated by Norman et al., in 1996) for a high performance database?
What one or two significant technology changes could cause a significant impact to the proposed solution?

I’ve been pondering this question for some time now and can’t quite seem to decide which way to lean. But, I’m convinced that the answer to #1 is that we need additional categories based on advances in virtualization. But, I am uncertain about the choice between a Hybrid System and a Shared Nothing system. The inevitable result of advancements in virtualization and clouds and such technologies seem to indicate that Shared Nothing is the way to go. But, Norman and others make a compelling case.

What do you think?

Michael Stonebraker. The case for shared nothing. Database Engineering, 9:4–9, 1986

Wow, the TPC-H speculation continues!

The real technology behind the ParAccel TPC-H results are revealed here!

It is definitely interesting to see that the ParAccel TPC-H result saga is not yet done. Posts by Daniel Abadi (interesting analysis but it seems simplistic at first blush) and reflections on Curt Monash’s blog are proving to be amusing, to say the least.

At the end of the day, whether ParAccel is a “true column store” (or a “vertically partitioned row store”), and whether it has too much disk capacity and disk bandwidth strike me as somewhat academic arguments that don’t recognize some basic facts.

ParAccel’s system (bloated, oversized, undersized, truly columnar, OR NOT) is only the second system to post 30TB results.
That system costs less than half of the price of the other set of published results. Since I said “basic facts”, I will point out that the other entry may provide higher resiliency and that may be a part of the higher price. I have not researched whether the additional price is justified, related to the higher resiliency, etc., I want to make the point that there is a difference in the stated resiliency of the two solutions that is not apparent from the performance claims.

It might just be my sheltered upbringing but:

I know of few customers who purchase cars solely for the manufacturers advertised MPG and similarly, I know of no one who purchases a data-warehouse from a specific vendor because of the published TPC-H results.
I know of few customers who will choose to make a purchase decision based on whether a car has a inline engine, a V-engine or a rotary engine and similarly, I know of no one who makes a buying decision on a data warehouse based on whether the technology is “truly columnar”, “vertically partitioned row store”, “row store with all indexes” or some other esoteric collection of interesting sounding words?

So, I ask you, why are so many people so stewed about ParAccel’s TPC-H numbers? In a few more days, will we have more posts about ParAccel’s TPC-H numbers than we will about whether Michael Jackson really died or whether this is all part of a “comeback tour”?

I can’t answer either of the questions above for sure but what I do know is this, ParAccel is getting some great publicity out of this whole thing.

And, I have it on good authority (it came to me in a dream last night) that ParAccel’s solution is based on high-performance trilithium crystals. (Note: I don’t know why this wasn’t disclosed in the full disclosure report). I hear that they chose 43 nodes because someone misremembered the “universal answer” from The Hitchhikers Guide to the Galaxy. By the time someone realized this, it was too late because the data load had begun. Remember you read it here first 🙂

Give it a rest folks!

P.S. Within minutes of posting this a well known heckler sent me email with the following explanation that confirms my hypothesis.

When a beam of matter and antimatter collide in dilithium we get a plasma field that powers warp drives within the “Sun” workstations. The warp drives that ParAccel uses are the Q-Warp variant which allow queries to run faster than the speed of light. A patent has been filed for this technique, don’t mention it in your blog please.

Not so fast, maybe relational databases aren’t dead!

Maybe the obituary announcing the demise of the relational database was premature!

Much has been written recently about the demise (or in some cases, the impending demise) of the relational database. “Relational databases are dead” writes Savio Rodrigues on July 2nd, I guess I missed the announcement and the funeral in the flood of emails and twitters about another high profile demise.

Some days ago, Michael Stonebraker authored an article with the title, “The End of a DBMS Era (Might be Upon Us)”. In September 2007 he made a similar argument in this article, and also in this 2005 paper with Uğur Çetintemel.

What Michael says here is absolutely true. And, in reality, Savio’s article just has a catchy title (and it worked). The body of the article makes a valid argument that there are some situations where the current “one size fits all” relational database offering that was born in the OLTP days may not be adequate for all data management problems.

So, let’s be perfectly clear about this; the issue isn’t that relational databases are dead. It is that a variety of use use cases are pushing the current relational database offerings to their limits.

I must emphasize that I consider relational databases (RDBMS’s) to be those systems that use a relational model (a definition consistent with http://en.wikipedia.org/wiki/Relational_database). As a result, columnar (or vertical) representations, row (or horizontal) representations, systems with hardware acceleration (FPGA’s, …) are all relational databases. There is arguably some confusion in terminology in the rest of this post, especially where I quote others who tend to use the term “Relational Database” more narrowly, so as to create a perception of differentiation between their product (columnar, analytic, …) and the conventional row oriented database which they refer to as an RDBMS.

Tony Bain begins his three part series about the problem with relational databases with an introduction where he says

“The specialist solutions have be slowly cropping up over the last 5 years and now today it wouldn’t be that unusual for an organization to choose a specialist data analytics database platform (such as those offered from Netezza, Greenplum, Vertica, Aster Data or Kickfire) over a generic database platform offered by IBM, Microsoft, Oracle or Sun for housing data for high end analytics.”

While I have some issues with his characterization of “specialist analytic database platforms” as something other than a Relational Database, I assume that he is using the term RDBMS to refer to the commonly available (general purpose) databases that are most often seen in OLTP environments.

I believe that whether you refer to a column oriented architecture (with or without compression), an architecture that uses hardware acceleration (Kickfire, Netezza, …) or a materialized view, you are attempting to address the same underlying issue; I/O is costly and performance is significantly improved when you reduce the I/O cost. Columnar representations significantly reduce I/O cost by not performing DMA on unnecessary columns of data. FPGA’s in Netezza serve a similar purpose; (among other things) they perform projections thereby reducing the amount of data that is DMA’ed. A materialized view with only the required columns (narrow table, thin table) serves the same purpose. In a similar manner (but for different reasons), indexes improve performance by quickly identifying the tuples that need to be DMA’ed.

Notice that all of these solutions fundamentally address one aspect of the problem; how to reduce the cost of I/O. The challenges that are facing databases these days are somewhat different. In addition to huge amounts of data that are being amassed (The Richard Winter article on the subject) there is a much broader variety of things that are being demanded of the repository of that information. For example, there is the “Search” model that has been discussed in a variety of contexts (web, peptide/nucleotide), the stream processing and data warehousing cases that have also received a fair amount of discussion.

Unlike the problem of I/O cost, many of these problems reflect issues with the fundamental structure and semantics of the SQL language. Some of these issues can be addressed with language extensions, User Defined Functions, MapReduce extensions and the like. But none of these address the underlying issue that the language and semantics were defined for a class of problems that we today come to classify as the “OLTP use case”.

Relational databases are not dead; on the contrary with the huge amounts of information that are being handled, they are more alive than ever before. The SQL language is not dead but it is in need of some improvements. That’s not something new; we’ve seen those in ’92, ’99, … But, more importantly the reason why the Relational Database and SQL have survived this long is because it is widely used and portable. By being an extensible and descriptive language, it has managed to adapt to many of the new requirements that were placed on it.

And if the current problems are significant, two more problems are just around the problem and waiting to rear their ugly heads. The first is the widespread adoption of the virtualization and the abstraction of computing resources. In addition to making it much hardware to adopt solutions with custom hardware (that cannot be virtualized), it introduces a level of unpredictability in I/O bandwidth, latency and performance. Right along with this, users are going to want the database to live on the cloud. With that will come all the requirements of scalability, ease of use and deployment that one associates with a cloud based offering (not just the deployment model). The second is the fact that users will expect one “solution” to meet a wide variety of demands including the current OLTP and reporting through the real time alerting that today’s “Google/Facebook/Twitter Generation” has come to demand (look-ma-no-silos).

These problems are going to drive a round of innovation, and the NoSQL trend is a good and healthy trend. In the same description of all the NoSQL and analytics alternatives, one should also mention the various vendors who are working on CEP solutions. As a result of all of these efforts, Relational Databases as we know them today (general purpose OLTP optimized, small data volume systems) will evolve into systems capable of managing huge volumes of data in a distributed/cloud/virtualized environment and capable of meeting a broad variety of consumer demands.

The current architectures that we know of (shared disk, shared nothing, shared memory) will need to be reconsidered in a virtualized environment. The architectures of our current databases will also need some changes to address the wide variety of consumer demands. Current optimization techniques will need to be adapted and the underlying data representations will have to change. But, in the end, I believe that the thing that will decide the success or failure of a technology in this area is the extent of compatibility and integration with the existing SQL language. If the system has a whole new set of semantics and is fundamentally incompatible with SQL I believe that adoption will slow. A system that extends SQL and meets these new requirements will do much better.

Relational Databases aren’t dead; the model of “one-size-fits-all” is certainly on shaky ground! There is a convergence between the virtualization/cloud paradigms, the cost and convenience advantages of managing large infrastructures in that model and the business need for large databases.

Fasten your seat-belts because the ride will be rough. But, it is a great time to be in the big-data-management field!

Highland Light, Provicetown, MA

If you have ever been to the Highland Light Lighthouse at Provincetown, you will likely recognize the picture below. The image is panoramic and the high resolution image is 15MB and about 13k pixels wide. The image appears cropped here, you can see the complete image at flickr (link on the left).

The image is a composite of 10 discrete images that were stitched using AutoStitch. The software is available for free evaluation at this site.

For more information about Highland Light you can visit their web page at http://www.lighthouse.cc/highland/

Finally, it stopped raining. Not quite!

New England has been having a little bit of rain lately. A little too much actually. Today, the sun finally made an appearance but it didn’t stop raining … So we got quite a nice rainbow!

rainbow

IE+Microsoft ROCKS. Ubuntu+Firefox BLOWS

I’m running Ubuntu 9.0.4 and it appears that to upgrade to Firefox 3.5 requires a PhD. Could I borrow yours?

Kevin Purdy has the following “One line install”

That’s not a true install; he just unpacks the tar-ball into the current directory.

Web Upd8 has the following steps

But, now I get daily updates?

What’s with this garbage? Complain as much as you want but Microsoft Software Updates will give you the latest Internet Explorer easily.

Could someone please make this easy?

It’s just numbers people!

I hadn’t read Justin Swanhart’s post (Why is everybody so steamed about a benchmark anyway?) from June 24th till just a short while ago. You can see it at here. Justin is right, it’s just numbers.

He did remind me of a statement from a short while ago by another well known philosopher.

It's a budget

I could not embed the script provided by quotesdaddy.com into this post. I have used their material in the past, they have a great selection of statements that will live on in history. Check them out at http://www.quotesdaddy.com/ 🙂

Head for the hills, here comes more jargon: Anti-Databases and NoSQL

We all know what SQL is, get ready to learn about NoSQL. There was a meet-up last month in San Francisco to discuss it and some details are available at this link. The call for the “inaugural get-together of the burgeoning NoSQL community” drew 150 people and if you are so inclined, you can subscribe to the email list here.

I’m no expert at NoSQL but the major gripe seems to be schema restriction and what Jon Travis calls “twisting your object” data to fit an RDBMS.

NoSQL appears to be a newly coined collective pronoun for a lot of technologies you have heard about before. They include Hadoop, BigTable, HBase, Hypertable, and CouchDB, and maybe some that you have not heard of before like Voldemort, Cassandra, and MongoDb.

But, then there is this thing called NoSQL. The first line on that web page reads

NoSQL is a fast, portable, relational database management system without arbitrary limits, (other than memory and processor speed) that runs under, and interacts with, the UNIX ¹ Operating System.

Now, I’m confused. Is this the NoSQL that was referenced in the non-conference at non-san-francisco last non-month?

My Point of View

From what I’ve seen so far, NoSQL seems to be a case of disruption at the “low end”. By “low end”, I refer to circumstances where the data model is simple and in those cases I can see the point that SQL imposes a very severe overhead. The same can be said in the case of a model that is relatively simple but refers to “objects” that don’t relate to a relational model (like a document). But, while this appears to be “low end” disruption right now, there is a real reason to believe that over time, NoSQL will grow into more complex models.

But, the big benefit of SQL is that it is declarative programming language that gave implementations the flexibility of implementing the program flow in a manner that matched the physical representation of the data. Witness the fact that the same basic SQL language has produced a myriad of representations and architectures ranging from shared nothing to shared disk, SMP to MPP, row oriented to column oriented, with cost based and rules based optimizers etc., etc.,

There is (thus far) a concern about the relative complexity of the NoSQL offerings when compared with the currently available SQL offerings. I’m sure that folks are giving this considerable attention and what comes next may be a very interesting piece of technology.

More on TPC-H comparisons

Three charts showing comparisons of TPC-H benchmark data.

Just a quick post to upload three charts that help visualize the numbers that Curt and I have been referring to in our posts. Curt’s original post was, my post was.

The first chart shows the disk to data ratio that was mentioned. Note that the X-Axis showing TPC-H scale factor is a logarithmic scale.The benchmark information shows that the ParAccel solution has in excess of 900TB of storage for the 30TB benchmark, the ratio is therefore in excess of 30:1.

The second chart shows the memory to data ratio. Note that both the X and Y Axis are logarithmic scales. The benchmark information shows that the ParAccel solution has 43 nodes and approximately 2.7TB of RAM, the ratio is therefore approximately 1:11 (or 9%).

The third chart shows the load time (in hours) for various recorded results. The ParAccel results indicate a load time of 3.48 hours. Note again that the X-Axis is a logarithmic scale.

For easy reading, I have labeled the ParAccel 30TB value on the chart. I have to admit, I don’t understand Curt’s point. And maybe others share this bewilderment? I think I’ve captured the numbers correctly, could someone help verify these please.

If the images above are shown as thumbnails, you may not be able to see the point I’m trying to make. You need to see the bigger images to see the pattern.

Revision

In response to an email, I looked at the data again and got the following RANK() information. Of the 151 results available today, the ParAccel 30TB numbers are 58th in Memory to Data and 115th in Disk to Data. It is meaningless to compare load time ranks without factoring in the scale and I’m not going to bother with that as the sample size at SF=30,000 is too small.

If you are willing to volunteer some of your time to review the spreadsheet with all of this information, I am happy to send you a copy. Just ask!

Learning from Joost

A lot of interesting articles have emerged in the wake of the recent happenings at Joost.

One post that caught my attention and deserves reading, and re-reading, and then reading one more time just to be sure is Ed Sim’s post, where he writes

raising too much money can be a curse and not a blessing

A lot to learn from in all of these articles.

Risks of mind controlled vehicles

Some risks of mind controlled vehicles are described.

It is now official, Toyota has announced that they are developing a mind controlled wheel chair. Unless Toyota is planning to take the world by storm by putting large numbers of motorized wheelchairs on the roads, I suspect this technology will soon make its way into their other products, maybe cars?

Of course, this will have some risks

Cars may inherit drivers personality (ouch, there are some people who shouldn’t be allowed to drive)
Cars with hungry drivers will make unpredictable turns into drive through lanes
Car could inadvertently eject annoying back seat drivers
Mortality rates for cheating husbands may go up (see http://www.thebostonchannel.com/news/1965115/detail.html, http://www.click2houston.com/news/1576669/detail.html, http://www.freerepublic.com/focus/news/840828/posts and a thousand other such posts)
It would be really hard to administer a driving test! Would the ability to think be a requirement for drivers?

Can you think of other risks? I didn’t think I would come to this conclusion but I kind of like the way things are right now where my car has a mind of its own.

Is TPC-H really a blight upon the industry?

A recap of some posts (Curt Monash, Merv Adrian) about the ParAccel TPC-H 30 TB benchmark numbers.

On June 22, Curt Monash posted an interesting entry on his blog about TPC-H in the wake of an announcement by ParAccel. On the same day, Merv Adrian posted another take on the same subject on his blog.

Let me begin with a couple of disclaimers.

First, I am currently employed by Dataupia, I used to be employed at Netezza in the past. I am not affiliated with ParAccel in any way, nor Sun, nor the TPC committee, nor the Toyota Motor Corporation, the EPA, nor any other entity related in any way with this discussion. And if you are curious about my affiliations with any other body, just ask.

Second, this blog is my own and does not represent or intend to represent the point of view of my employer, former employer(s), or any other person or entity than myself. Any resemblance to the opinions or points of view of anyone other than myself are entirely coincidental.

As with any other benchmark, TPC-H only serves to illustrate how well or poorly a system was able to process a specified workload. If you happen to run a data warehouse that tracks parts, orders, suppliers, and lineitems in orders in 25 countries and 5 nations that resemble the TPC-H specification, your data warehouse may look something like the one specified in the benchmark specification. And if your business problems are similar to the twenty something queries that are presented in the specification, you can leverage hundreds of person-hours of free tuning advice given to you by the makers of most major databases and hardware.

In that regard, I feel that excellent performance on a published TPC-H benchmark does not guarantee that the same configuration would work well in my data warehouse environment.

But, if I understand correctly, the crux of the argument that Curt makes is that the benchmark configurations are bloated (and he cites the following examples)

43 nodes to run the benchmark at SF 30,000
each node has 64 GB of RAM (total of over 2.5TB of RAM)
each node has 24 TB of disk (total of over 900TB of disk)

which leads to a “RAM:DATA ratio” of approximately 1:11 and a “DISK:DATA ratio” of approximately 32:1.

Let’s look at the DISK:DATA ratio first

What no one seems to have pointed out (and I apologize if I didn’t catch it in the ocean of responses) is that this 32:1 DISK:DATA ratio is the ratio between total disk capacity and data and therefore includes overheads.

First, whether it is in a benchmark context or a real life situation, one expects data protection in one form or another. The benchmark report indicates that the systems used RAID 0 and RAID 1 for various components. So, at the very least, the number should be approximately 16:1. In addition, the same disk space is also used for the Operating System, Operating System Swap as well as temporary table space. Therefore, I don’t know whether it is reasonable to assume that even with good compression, a system would acheive a 1:1 ratio between data and disk space but I would like to know more about this.

“By way of contrast, real-life analytic DBMS with good compression often have disk/data ratios of well under 1:1.”

Leaving the issue of DISK:DATA ratio aside, one thing that most performance tuning looks at is the number of “spindles”. And, having a large number of spindles is a good thing for performance whether it is in a benchmark or in real life. Given current disk drive prices, it is reasonable to assume that a pre-configured server comes with 500GB drives, as is the case with the Sun system that was used in the ParAccel benchmark. If I were to purchase a server today, I would expect either 500GB drives or 1TB drives. If it were necessary to have a lower DISK:DATA ratio and reducing that ratio had some value in real life, maybe the benchmark could have been conducted with smaller disk drives.

Reading section 5.2 of the Full Disclosure Report it is clear that the benchmark did not use all 900 or so Terabytes of data. If I understand the math in that section correctly, the benchmark is using the equivalent of 24 drives and 50GB per drive on each node for data. That is a total of approximately 52TB of storage set aside for the database data. That’s pretty respectable! Richard Gostanian in his post to Curt’s blog (June 24th, 2009 7:34 am) indicates that they only needed about 20TB of data. I can’t reconcile the math but we’re at least in the right ball-park.

And as for the RAM:DATA ratio, the ratio is 1:11. I find it hard to understand how the benchmark could have run entirely from RAM as conjectured by Curt.

“And so I conjecture that ParAccel’s latest TPC-H benchmark ran (almost) entirely in RAM as well.”

From my experience in sizing systems, one looks at more things than just the physical disk capacity. One should also consider things like concurrency, query complexity, and expected response times. I’ve been analyzing TPC-H numbers (for an unrelated exercise) and I will post some more information from that analysis over the next couple of weeks.

On the whole, I think TPC-H performance numbers (QPH, $/QPH) are as predictive of system performance in a specific data warehouse implementation as the EPA ratings on cars are of actual mileage that one may see in practice. If available, they may serve as one factor that a buyer could consider in a buying decision. In addition to reviewing the mileage information for a car, I’ll also take a test drive, speak to someone who drives the same car, and if possible rent the same make and model for a weekend to make sure I like it. I wouldn’t rely on just the EPA ratings so why should one assume that a person purchasing a data warehouse would rely solely on TPC-H performance numbers?

As an aside, does anyone want to buy a 2000 Toyota Sienna Mini Van? It is white in color and gave 22.4 mpg over the last 2000 or so miles.

No cloud in sight!

The conventional wisdom at the beginning of ’09 was that the economic downturn would catapult cloud adoption but that hasn’t quite happened. This post explores trends and possible reasons for the slow adoption as well as what the future may hold.

A lot has been written in the past few days about Cloud Computing adoption based on a survey by ITIC (http://www.itic-corp.com/). At the time of this writing, I haven’t been able to locate a copy of this report or a link with more details online but most articles referencing this survey quote Laura DiDio as saying,

“An overwhelming 85% majority of corporate customers will not implement a private or public cloud computing infrastructure in 2009 because of fears that cloud providers may not be able to adequately secure sensitive corporate data”.

In another part of the country, structure09 had a lot of discussion about Cloud Computing. Moderating a panel of VC’s, Paul Kedrosky asked for a show of hands of VC’s who run their business on the cloud. To quote Liz Gannes,

“Let’s just say the hands did not go flying up”.

Elsewhere, a GigaOM report by George Gilbert and Juergen Urbanski conclude that leading storage vendors are planning their innovation around a three year time frame, expecting adoption of new storage technologies to coincide with emergence from the current recession.

My point of view

In the short term, services that are already “networked” will begin to migrate into the cloud. The migration may begin at the individual and SMB end of the market rather than at the Fortune 100. Email and CRM applications will be the poster-children for this wave.

PMCrunch also lists some SMB ERP solutions that will be in this early wave of migration.

But, this wave will primarily target the provision of application services through a different delivery model (application hosted on a remote server instead of a corporate server).

It will be a while before cloud based office applications (word-processing, spreadsheets, presentations) become mainstream. The issue is not so much security as it is network connectivity. The cloud is useless to a person who is not on the network and until ubiquitous high bandwidth network connectivity is available everywhere, and at an accessible and reasonable cost, the cloud platform will not be able to move forward.

We are beginning to see increased adoption in Broadband WiFi or Cellular Data in the US but the costs are still too high and service is still insufficient. Just ask anyone who has tried to get online at one of the many airports and hotels in the US.

Gartner highlights five key attributes of Cloud Computing.

Uses Internet Technologies
Service Based
Metered by Use
Shared
Scalable and Elastic

Note that I have re-ordered them into what I believe is the order in which cloud adoption will progress. The early adoption will be in applications that “Uses Internet Technologies” and “Service Based” and the last will be “Scalable and Elastic”.

As stated above, the early adopters will deploy applications with a clearly defined and “static” set of deliverables in areas that currently require the user to have network connectivity (i.e. do no worse than current, change the application delivery model from in-house to hosted). In parallel, corporations will begin to deploy private clouds for use within their firewalls.

As high bandwidth connectivity is more easily available adoption will increase, currently I think that is the real limitation.

Data Security will be built along the way, as will best practices on things like Escrow and mechanisms to migrate from one service provider to another.

Two other things that could kick cloud adoption into high gear are

the delivery of a cloud platform from a company like Akamai (why hasn’t this happened yet?)
a mechanism that would allow applications to scale based on load and use the right amount of cloud resource. Applications like web servers can scale based on client demand but this isn’t (yet) the case with other downstream services like databases or mail servers.

That’s my point of view, and I’d love to hear yours especially in the area of companies that are addressing the problem of providing a cloud user the ability to migrate from one provider to another, or mechanisms to dynamically scale services like databases and mail servers.

What’s next in tech? Boston, June 25th 2009

Recap of “What’s next in Tech”, Boston, June 25th 2009

What’s next in tech: Exploring the Growth Opportunities of 2009 and Beyond, June 25th 2009.

Scott Kirsner moderated two panels during the event; the first with three Veture Capitalists and the second with five entrepreneurs.

The first panel consisted of:

– Michael Greeley, General Partner, Flybridge Capital Partners

– Bijan Sabet, General Partner, Spark Capital

– Neil Sequiera, Managing Director, General Catalyst Partners

The second panel consisted of:

– Mike Dornbrook, COO, Harmonix Music Systems (makers of “Rock Band”)

– Helen Greiner, co-founder of iRobot Corp. and founder of The Droid Works

– Brian Halligan, social media expert and CEO of HubSpot

– Tim Healy, CEO of EnerNOC

– Ellen Rubin, Founder & VP/Product of CloudSwitch

Those who asked questions got a copy of Dan Bricklin’s book, Bricklin on Technology. There was also a DVD of some other book (I don’t know what it was) that was being given away.

Show of hands at the beginning was that 100% of the people were optimistic about the recovery and the future. Nice discussion though after leaving the session, I don’t get the feeling that anyone addressed the question “What’s next in tech” head on. Most of the conversation was about what has happened in tech and a lot of discussion about Twitter and Facebook. There were a lot of questions to the panel from the audience about what their companies were doing to help young entrepreneurs.And for an audience that was supposed to contain people interested in “what’s next”, no one wanted the DVD of the book, everyone wanted the paper copies. Hmm …

There has been a lot of interesting discussion about the topic and the event. One that caught my eye was Larry Cheng’s blog.

How many outstanding Shares in a start-up

Finding the number of outstanding shares in a start-up.

This question has come up quite often, most recently when I was having lunch with some co-workers last week.

While evaluating an offer at a start-up have you wondered whether your 10,000 shares was a significant chunk of money or a drop in the ocean?

Have you ever wondered how many shares are outstanding in a start-up you heard about?

This is a non-issue for a public company, the number of outstanding shares is a matter of public record. But not all start-up’s want to share this information with you.

I can’t say this for every state in the union but in MA, you can get a pretty good idea by looking at the Annual Report that every organization is required to file. It won’t be current information, the best you can do is get the most recent annual information and that means you can’t know anything about a very new company, but it is information that I would have liked to have had on some occasions in the past.

The link to search the Corporate Database is http://corp.sec.state.ma.us/corp/corpsearch/corpsearchinput.asp

Enter the name of the company and scroll to the bottom where you see a box that allows you to choose “Annual Reports”. Click that and you can read a recent Annual Report which will have the information you need.

If you find that this doesn’t work or is incomplete, please post your comments here. I’m sure others will appreciate the information.

And if you have information about other states, that is most welcome.

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Content Filtering

Multiple Networks!

Enabling OpenDNS on a Per-PC basis.

Other uses of OpenDNS

Other comments

My Point of View

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: