Multithreaded File I/O (Reflections on Dr. Dobb’s article by Stefan Wörthmüller)

Thoughts on the results that Stefan Wörthmüller reports in his article on Dr. Dobb’s Journal.

I ran across an interesting article on Multi-Threaded File I/O in Dr. Dobb’s today. You can read the article at http://www.ddj.com/hpc-high-performance-computing/220300055

I was particularly intrigued by the statements on variability,

I repeated the entire test suite three times. The values I present here are the average of the three runs. The standard deviation in most cases did not exceed 10-20%. All tests have been also run three times with reboots after every run, so that no file was accessed from cache.

Initially, I thought 10-20% was a bit much; this seemed like a relatively straightforward test and variability should be low. Then I looked at the source code for the test and I’m now even more puzzled about the variability.

Get a copy of the sources here. It is a single source file and in the only case of randomization, it uses rand() to get a location into the file.

The code to do the random seek is below

   if(RandomCount)
   {
      // Seek new position for Random access
      if(i >= maxCount)
         break;
      long pos = (rand() * fileSize) / RAND_MAX - BlockSize;
      fseek(file, pos, SEEK_SET);
   }

While this is a multi-threaded program, I see no calls to srand() anywhere in the program. Just to be sure, I modified Stefan’s program as attached here. (My apologies, the file has an extension of .jpg because I can’t upload a .cpp or .zip onto this free wordpress blog. The file is a Windows ZIP file, just rename it).

///////////////////////////////////////////////////////////////////////////////
// mtRandom.cpp   Amrith Kumar 2009 (amrith (dot) kumar (at) gmail (dot) com
// This program is adapted from the program FileReadThreads.cpp by Stefan Woerthmueller
// No rights reserved. Feel Free to do what ever you like with this code
// but don't blame me if the world comes to an end.

#include "Windows.h"
#include "stdio.h"
#include "conio.h"
#include
#include 

#include
#include 

///////////////////////////////////////////////////////////////////////////////
// Worker Thread Function
///////////////////////////////////////////////////////////////////////////////

DWORD WINAPI threadEntry(LPVOID lpThreadParameter)

{
    int index = (int)lpThreadParameter;
        FILE * fp;
        char filename[32];

        sprintf ( filename, "file-%d.txt", index );

        fprintf ( stderr, "Thread %d startedn", index );
        if ((fp = fopen ( filename, "w" )) == (FILE * ) NULL)
        {
                fprintf (stderr, "Error opening file %sn", filename );
        }
        else
        {
                for (int i = 0; i < 10; i ++)
                {
                        fprintf ( fp, "%un", rand());
                }

                fclose (fp);
        }

        fprintf ( stderr, "Thread %d donen", index );

    return 0;
}

#define MAX_THREADS (5)

int main(int argc, char* argv[])

{
    HANDLE h_workThread[MAX_THREADS];

    for(int i = 0; i < MAX_THREADS; i++)
    {
        h_workThread[i] = CreateThread(NULL, 0, threadEntry, (LPVOID) i, 0, NULL );
        Sleep(1000);
    }

    WaitForMultipleObjects(MAX_THREADS, h_workThread, TRUE, INFINITE);
    printf ( "All done. Good byen" );
    return 0;
}

So, I confirmed that Stefan will be getting the same sequence of values from rand() over and over again, across reboots.

Why then is he still seeing 10-20% variability? Beats me, something smells here … I would assume that from run to run, there should be very little variability.

Thoughts?

From the “way-back machine”

We’ve all heard the expression “way-back machine” and some of us know about tools like the Time Machine. But, did you know that there is in fact a “way-back machine” ?

From time to time, I have used this service and it is one of those nice corners of the web that is nice to know. I was reminded of it this morning in a conversation and that led to a nice walk through history.

If you aren’t familiar with the “way-back machine”, take a look at http://www.archive.org/web/web.php

Some day you may wonder what a web page looked like a while ago and the “way-back machine” is your solution.

Here are some interesting ones that I looked at today. The Time Magazine in February 1999.

Time magazine web page in February 1999
Time magazine web page in February 1999

Ever wondered what the Dataupia web page looked like in February 2006? I know someone who would get a kick out of it so I went and looked it up.

The Dataupia web page from February 2006
The Dataupia web page from February 2006

Check it out sometime, the way back machine is a wonderful afternoon diversion.

The “way back” archive is not complete, alas!

Florida recounts

Diluting education standards in Kansas (part II)

Coming in the aftermath of the efforts to outlaw the teaching of evolution in the state, this story about Kansas is unfortunate.

http://blog.acm.org/archives/csta/2009/09/post_4.html

http://usacm.acm.org/usacm/weblog/index.php?p=741

The state has significant employment problems and the recent down turn in the economy has caused significant impact on the aircraft industry in the state. With a nascent IT start-up scene there, this is probably the worst publicity that the state could have hoped for.

Who are you, really? The value of incorrect response in challenge-response style authentication.

We all know how service providers validate the identity of callers. But, how do you validate the identity of the service provider on the other end of the telephone? In the area of computer security, the inexact challenge response mechanism is a useful way of validating identities; a wrong answer and the response to a wrong answer tell a lot.

Service providers (electricity, cable, wireless phone, POTS telephone, newspaper, banks, credit card companies) are regularly faced with the challenge of identifying and validating the identity of the individual who has called customer service. They have come up with elaborate schemes involving the last four digits of your social security number, your mailing address, your mother’s maiden name, your date of birth and so on. The risks associated with all of these have been discussed at great length elsewhere; social security numbers are guessable (see “Predicting Social Security Numbers from Public Data”, Acquisti and Gross), mailing addresses can be stolen, mother’s maiden names can be obtained (and in some Latin American countries your mother’s maiden name is part of your name) and people hand out their dates of birth on social networking sites without a problem!

Bogus Parking ticket
Bogus Parking ticket

So, apart from identity theft by someone guessing at your identity, we also have identity theft because people give out critical information about themselves. Phishing attacks are well documented, and we have heard of the viruses that have spread based on fake parking tickets.

Privacy and Information Security experts caution you against giving out key information to strangers; very sound advice. But, how do you know who you are talking to?

Consider these two examples of things that have happened to me.

1. I receive a telephone call from a person who identifies himself as being an investment advisor from a financial services company where I have an account. He informs me that I am eligible for a certain service that I am not utilizing and he would like to offer me that service. I am interested in this service and I ask him to tell me more. In order to tell me more, he asks me to verify my identity. He wants the usual four things and I ask him to verify in some way that he is in fact who he claims to be. With righteous indignation he informs me that he cannot reveal any account information until I can prove that I am who I claim to be. Of course, that sets me off and I tell him that I would happily identify myself to be who he thinks I am, if he can identify that he is in fact who he claims to be. Needless to say, he did not sell me the service that he wanted to.

2. I call a service provider because I want to make some change to my account. They have “upgraded their systems” and having looked up my account number and having “matched my phone number to the account”, the put me through to a real live person. After discussing how we will make the change that I want, the person then asks me to provide my address. Ok, now I wonder why that would be? Don’t they have my address, surely they’ve managed to send me a bill every month.

“For your protection, we need to validate four pieces of information about you before we can proceed”, I am told.

The four items are my address, my date of birth, the last four digits of my social security number and the “name on the account”.

Of course, I ask the nice person to validate something (for example, tell me how much my last bill was) before I proceed. I am told that for my own protection, they cannot do that.

challenge-responseComputer scientists have developed several techniques that provide “challenge-response” style authentication where both parties can convince themselves that they are who they claim to be. For example, public-key/private-key encryption provides a simple way in which to do this. Either party can generate a random string and provide it to the other asking the other to encrypt it using the key that they have. The encrypted response is returned to the sender and that is sufficient to guarantee that the peer does in fact posses the appropriate “token”.

In the context of a service provider and a customer, there would be a mechanism for the service provider to verify that the “alleged customer” is in fact the customer who he or she claims to be but the customer also verifies that the provider is in fact the real thing.

The risks in the first scenario are absolutely obvious; I recently received a text message (vector) that read

“MsgID4_X6V…@v.w RTN FCU Alert: Your CARD has been DEACTIVATED. Please contact us at 978-596-0795 to REACTIVATE your CARD. CB: 978-596-0795”

A quick web search does in fact show that this is a phishing event. Whether someone tracked that phone number down and find out if they are a poor unsuspecting victim or a perpetrator, I am not sure.

But, what does one do when in fact they receive an email or a phone call from a vendor with whom they have a relationship?

One could contact a psychic to find out if it is authentic, like check the New England SEERs.

http://twitter.com/ILNorg/status/3786484194

http://twitter.com/NewEnglandSEERs

RT @Lucy_Diamond 978-596-0795 do not return call on text. Call police or your real bank. Caution bank fraud. Never give your pin to anyone

RT @Lucy_Diamond Warning bank scam via cell phone text remember never give your pin number to anyone. Your bank won’t ask you they know it

Agent: For your security please verify some information about your account.What is your account number

Me: Provide my account number

Agent: Thank you, could you give me your passphrase?

Me: ketchup

Agent: Thank you. Could you give me your mother’s maiden name

Me: Hoover Decker

Agent: Thank you. and the last four digits of your SSN

Me: 2004

Agent: Just one more thing, your date of birth please

Me: February 14th 1942

Agent: Thank you

Agent: For your security please verify some information about your account.What is your account number

Me: Provide my account number

Agent: Thank you, could you give me your passphrase?

Me: ketchup

Agent: That’s not what I have on the account

Me: Really, let me look for a second. What about campbell?

Agent: No, that’s not it either. It looks like you chose something else, but similar.

Me: Oh, of course, Heinz58. Sorry about that

Agent: That’s right, how about your mother’s maiden name.

Me: Hoover Decker

Agent: No, that’s not it.

Me: Sorry, Hoover Bissel

Agent: That’s right. And the last four of your social please

Me: 2007

Agent: thank you, and the date of birth

Me: Feb 29, 1946

Agent: Thank you

Agent: For your security please verify some information about your account.

What is your account number

Me: Provide my account number

Agent: Thank you, could you give me your passphrase?

Me: ketchup

Agent: Thank you. Could you give me your mother’s maiden name

Me: Hoover Decker

Agent: Thank you. and the last four digits of your SSN

Me: 2004

Agent: Just one more thing, your date of birth please

Me: February 14th 1942

Agent: Thank you. Could you verify the address to which you would like us to ship the package.

(At this point, I’m very puzzled and not really sure what is going on)

Me: Provided my real address (say 10 Any Drive, Somecity, 34567)

Agent: I’m sorry, I don’t see that address on the account, I have a different address.

Me: What address do you have?

Agent: I have 14 Someother Drive, Anothercity, 36789.

The address the agent provided was in fact a previous location where I had lived.

What has happened is that the cable company (like many other companies these days) has outsourced the fulfillment of the orders related to this service. In reality, all they want is to verify that the account number and the address match! How they had an old address, I cannot imagine. But, if the address had matched, they would have mailed a little package out to me (it was at no charge anyway) and no one would be any the wiser.

But, I hung up and called the cable company on the phone number on my bill and got the full fourth-degree. And they wanted to talk to “the account owner”. But, I had forgotten what I told them my SSN was … Ironically, they went right along to the next question and later told me what the last four digits of my SSN were 🙂

Someone said they were interested in the security and privacy of my personal information?

We people born on the 29th of February 1946 are very skeptical.

Faster or Free

I don’t know how Bruce Scott’s article showed up in my mailbox but I’m confused by it (happens a lot these days).

I agree with him that too much has been made about whether a system is a columnar system or a truly columnar system or a vertically partitioned row store and what really matters to a customer is TCO and price-performance in their own environment. Bruce says as much in his blog post

Let’s start talking about what customers really care about: price-performance and low cost of ownership. Customers want to do more with less. They want less initial cost and less ongoing cost.

Then, he goes on to say

On this last point, we have found that we always outperform our competitors in customer created benchmarks, especially when there are previously unforeseen queries. Due to customer confidentiality this can appear to be a hollow claim that we cannot always publicly back up with customer testimonials. Because of this, we’ve decided to put our money where our mouth is in our “Faster or Free” offer. Check out our website for details here: http://www.paraccel.com/faster_or_free.php

So, I went and looked at that link. There, it says:

Our promise: The ParAccel Analytic Database™ will be faster than your current database management system or any that you are currently evaluating, or our software license is free (Maintenance is not included. Requires an executed evaluation agreement.)

To be consistent, should that not make the promise that the ParAccel offering would provide better price-performance and lower TCO than the current system or the one being evaluated? After all, that is what customers really care about.

I’m confused. More coffee!

Oh, there’s more! Check out this link http://www.paraccel.com/cash_for_clunkers.php

Talk about fine print:

* Trade-in value is equivalent to the first year free of a three year subscription contract based on an annual subscription rate of $15K/user terabyte of data. Servers are purchased separately.

Desktop Email Client vs. GMail: Why desktop mail clients are still better than the GMail interface

The GMail user interface, while very good and much better than some of the others lacks some useful functionality to make it a complete replacement for a desktop email client like Outlook.

Joe Kissell writes in CIO magazine about the six reasons why desktop email clients still rule. He opines that he would take a desktop email client any day and provides the following reason, and six more:

Well, there is the issue of outages like the one Gmail experienced this week. I like to be able to access my e-mail whenever I want. But beyond that, webmail still lags far behind desktop clients in several key areas.

Much has been written by many on this subject. As long ago as 2005, Cedric pronounced his verdict. Brad Shorr had a more measured comparison that I read before I made the switch about a month ago. Lifehacker pronounced the definitive comparison (IMHO it fell flat, their verdicts were shallow). Rakesh Agarwal presented some good insights and suggestions.

I read all of this and tried to decide what to do about a month ago. Here is a description of my usage.

My Usage

1. Email Accounts

I have about a dozen. Some are through GMail, some are on domains that I own, one is at Yahoo, one at Hotmail and then there are a smattering of others like aol.com, ZoHo and mail.com. While employed (currently a free agent) I have always had an Exchange Server to look at as well.

2. Email volume

Excluding work related email, I receive about 20 or 30 messages a day (after eliminating SPAM).

3. Contacts

I have about 1200 contacts in my address book.

4. Mobile device

I have a Windows Mobile based phone and I use it for calendaring, email and as a telephone. I like to keep my complete contact list on my phone.

5. Access to Email

I am NOT a Power-User who keeps reading email all the time (there are some who will challenge this). If I can read my email on my phone, I’m usually happy. But, I prefer a big screen view when possible.

6. I like to use instant messengers. Since I have accounts on AOL IM, Y!, HotMail and Google, I use a single application that supports all the flavors of IM.

Seems simple enough, right? Think again. Here is why, after migrating entirely to GMail, I have switched back to a desktop client.

The Problem

1. Google Calendar and Contact Synchronization is a load of crap.

Google does somethings well. GMail (the mail and parts of the interface) are one of these things. They support POP and IMAP, they support consolidation of accounts through POP or IMAP, they allow email to be sent as if from another account. They are far ahead of the rest. With Google Labs you can get a pretty slick interface. But, Calendar and Contact Synchronization really suck.

For example, I start off with 1200 contacts and synchronize my mobile device with Google. How do I do it? By creating an Exchange Server called m.google.com and having it provide Calendar and Contacts. You can read about that here. After synchronizing the two, I had 1240 or so contacts on my phone. Ok, I had sent email to 40 people through GMail who were not in my address book. Great!

Then I changed one persons email address and the wheels came off the train. It tried to synchronize everything and ended up with some errors.

I started off with about 120 entries in my calendar after synchronizing every hour, I now have 270 or so. Well, each time it felt that contacts had been changed, it refreshed them and I now have seventeen appointments tomorrow saying it is someones birthday. Really, do I get extra cake or something?

2. Google Chat and Contact Synchronization don’t work well together.

After synchronizing contacts my Google Chat went to hell in a hand-basket. There’s no way to tell why, I just don’t see anyone in my Google Chat window any more.

Google does some things well. The GMail server side is one of them. As Bing points out, Google Search returns tons of crap (not that Bing does much better). Calendar, Contacts and Chat are still not in the “does well” category.

So, it is back to Outlook Calendar and Contacts and POP Email. I will get all the email to land in my GMail account though, nice backup and all that. But GMail Web interface, bye-bye. Outlook 2007 here I come, again.

The best of both worlds

The stable interface between a phone and Outlook, a stable calendar, contacts and email interface (hangs from time to time but the button still works), and a nice online backup at Google. And, if I’m at a PC other than my own, the web interface works in a pinch.

POP all mail from a variety of accounts into one GMail account and access just that one account from both the telephone and the desktop client. And install that IM client application again.

What do I lose? The threaded message format that GMail has (that I don’t like). Yippie!

ParAccel TPC-H 30TB results challenged!

Watch the feeding frenzy now that ParAccel’s TPC-H 30TB results have been challenged.

Before I had my morning cup of coffee, I found an email message with the subject “ParAccel ADVISORY” sitting in my mail box. Now, I’m not exactly sure why I got this message from Renee Deger of GlobalFluency so my first suspicion was that this was a scam and that someone in Nigeria would be giving me a million dollars if I did something.

But, I was disappointed. Renee Deger is not a Nigerian bank tycoon who will make me rich. In fact, ParAccel’s own blog indicates that their 30TB results have been challenged.

We wanted you to hear it from us first.  Our TPC-H Benchmark for performance and price-performance at the 30-terabyte scale was taken down following a challenge by one of our competitors and a review by the TPC.  We executed this benchmark in collaboration with Sun and under the watch of a highly qualified and experienced auditor.   Although it has been around for many years, the TPC-H specification is still subject to interpretation, and our interpretation of some of the points raised was not accepted by the TPC review board.

None of these items impacts our actual performance, which is still the best in the industry.  We will rerun the benchmark at our earliest opportunity according to the interpretations set forth by the TPC review board. We remain committed to the organization, as outlined in a blog post by Barry Zane, our CTO, here: http://paraccel.com/data_warehouse_blog/?p=74#more-74.

Please see the company blog for a post by David Steinhoff in the office of the CTO for further info: http://paraccel.com/data_warehouse_blog/?p=104#more-104

I read David Steinhoff’s blog as well. He writes

This last week, our June 2009 30TB results were challenged and found to be in violation of certain TPC rules. Accordingly, our results will soon be taken down from the TPC website.

We published our results in good faith. We used our standard, customer available database to run the benchmark (we wanted the benchmark to reflect the incredible performance our customers receive). However, the body of TPC rules is complex and undergoes constant interpretation; we are still relatively new to the benchmark game and are still learning, and we made some mistakes.

While we cannot get into the details of the challenges to our results (TPC proceedings are confidential and we would be in violation if we did), we can state with confidence that our query performance was in no way enhanced by the items that were challenged.

We can also say with confidence that we will publish again, soon.

Now, some competitor or competitor loyalist may try to make more of this than there is … we all know there is the risk of tabloid frenzy around any adversity in a society with free speech … and we wouldn’t have it any other way.

It is unfortunate that the proceedings are confidential and cannot be shared. I hope you republish your results at 30TB.

Contrary to some a long list of pundits, I believe that these benchmarks have an important place in the marketing of a product and its capabilities.

I reiterate what I said in a previous blog post

ParAccel’s solution is based on high-performance trilithium crystals. (Note: I don’t know why this wasn’t disclosed in the full disclosure report).

I hope the challenge was not about the trilithium crystals and the fact that you didn’t disclose it in the full disclosure report.