Lazycoder

28Apr/054

Size of scientific datasets

I deal mostly with medical data, some appointment data, but mostly data about a patients labs, medications, and what they are having taken out or put into them. These databases can get pretty big, right now I’ve got three main databases. One is about 1.5GB, the second is abou 2.3GB, and the third, which is a staging database for the other two, is about 4GB. These aren’t very big compared to the multi-terrabyte (10^12 or 2^40 bytes) databases you’ll see at Fortune 500 companies or at Google.

But then you start to look at the data sets that supercolliders turn out. These supercolliders can churn out 10 petabyte (10^15 or 2^50 bytes) datasets each time they run. Some astronomical data sets run into the 10PB/year range. None of these are in a RDBMS yet. None can handle that much data all at once yet. Datasets in this range fascinate me because I can’t really visualize a multi-TB dataset, let alone a single petabyte. I know that some of the researchers here have multiple 100+GB Firewire external hard drives chained together holding DNA and protein sequence information. Usually in hundreds and thousands of Excel spreadsheets(groan).

Check out this PowerPoint by Jim Gray of MS Research where he talks about some of the larger datasets. Interesting that Google was a 1.5PB as of spring 2001. WAY before the GMail initiative or Google Video upload. I wonder where they are now?

  • http://www.newestindustry.org/ Stephen Pierzchala

    I work for a company that inserts nearly 200 million rows of data PER DAY into an RDBMS. I used to work for a company that inserted 100 million rows of data per day.

    Only in the Terabytes…but still, it is very cool to have that level of data to analyze.

    smp

  • Scott

    Yeah. It’s nice, but it’s also kind of scary. Right now our DB’s are small, but we are only selecting a very small portion of the data when we are importing. For example, in one database we are only pulling over 7 different types of labs that a patient can have. I’m not sure what the upper limit is, but I know we are only importing standard blood tests and a few cancer specific hormone tests. Also we’ve only been importing patients for about 8 months now. Since January we’ve added 2000+ patients in one database alone.

    Right now, if we need to change the schema we can just make the change. We backup the database of course, but before too long a full backup could take several hours so these changes need to be thought out more. If we ever get to the point where a backup takes a full 24 hours, we need to re-think our backup strategy. Right now we do daily backups, keeping the previous day on the same hard drive and offloading the two day old backup to tape. The fact that researchers and physicians may want to use the data to actually TREAT people adds a new level of stress. Data integrity becomes a way of life.

  • http://haacked.com/ haacked (aka Phil)

    Imagine trying to load an ADO.NET DataSet with that. Your system would probably explode.

  • Scott

    Well of course then you bind the dataset to an ASP.NET Datagrid with the viewstate on.

    Man I can’t even imagine trying “Select *” on a table with that much data.

  • Pingback: datasets - StartTags.com