Size of scientific datasets
I deal mostly with medical data, some appointment data, but mostly data about a patients labs, medications, and what they are having taken out or put into them. These databases can get pretty big, right now I’ve got three main databases. One is about 1.5GB, the second is abou 2.3GB, and the third, which is a staging database for the other two, is about 4GB. These aren’t very big compared to the multi-terrabyte (10^12 or 2^40 bytes) databases you’ll see at Fortune 500 companies or at Google.
But then you start to look at the data sets that supercolliders turn out. These supercolliders can churn out 10 petabyte (10^15 or 2^50 bytes) datasets each time they run. Some astronomical data sets run into the 10PB/year range. None of these are in a RDBMS yet. None can handle that much data all at once yet. Datasets in this range fascinate me because I can’t really visualize a multi-TB dataset, let alone a single petabyte. I know that some of the researchers here have multiple 100+GB Firewire external hard drives chained together holding DNA and protein sequence information. Usually in hundreds and thousands of Excel spreadsheets(groan).
Check out this PowerPoint by Jim Gray of MS Research where he talks about some of the larger datasets. Interesting that Google was a 1.5PB as of spring 2001. WAY before the GMail initiative or Google Video upload. I wonder where they are now?



Pingback: datasets - StartTags.com