Greg’s retelling of the recent Newsgator outage is almost a mirror image of my own experience last week. Except on a much smaller scale.
We have several databases contained in a two-node failover cluster. More on the hardware issues later. Two of the databases are about 2GB and a third is around 5GB. No BLOBs, just a lot of rows. We have two jobs that runs at 6AM that moves data from the largest database, which we are using as a staging database, into the other two smaller databases. (One for breast cancer and one for prostate cancer). A third job exports all of the data tables from the prostate database into a Access database that a research nurse uses to run queries for the researchers and clinicians. I noticed that I didn’t receive a notice that morning informing me of the successful completion of the Access export at the usual time. I was putting out a lot of fires that morning so I didin’t think anything of it. I figured the job was taking longer because we were adding a lot of data to one of the larger tables. I noticed an error in one of the data feeds between the staging database and our application database that caused over 4 million rows to be inserted, the job usually inserted somewhere between five thousand and fifteen thousand rows per day. I’m able to identify the errant rows and I run a delete-where query to get rid of them, we have audit triggers set up on the table that run on insert, update, and delete. Inserting a snapshot of the row into a mirror table. I, thinking myself to be a very clever boy, make sure to disable these triggers before running my delete. No need to fill up the audit tables with a mistake.
Next thing I know, my phone is ringing. The system admin who is in charge of our database servers is asking me what the heck is going on. The servers tempdb has grown from 6MB to 14GB. Then the user emails start (we only have 5 users for this application so it’s not too bad). “Weird errors”, they say. they send me a snapshot and I see the SQL connection is timing out. Hmmmmm. We check the running processes on the server and notice that there is a table lock on the table I’m deleting the rows from. That’s not good. Ok we’ll kill that process. Shrinking the tempdb requires a service restart so we schedule that.
I did some poking around and found out how this whole mess started. First it started with the daily feed into the largest table we have, Labs. Over 2 million rows. During the daily feed, we use a compound key to determine uniqueness for a row. The compound key is made up of two int sequence numbers that are assigned by the institution feeding us the data. Everything was fine for a while, but then some duplicate rows started showing up in our Labs table. It turns out that the source of our feed is actually a two-node cluster. Each of which receives inserts separately. From time to time, the two servers would synchronize the two int keys using two OTHER “highly unique” keys that they were not feeding to us. This meant that, from time to time, a row we had already imported would have it’s sequence numbers assigned to new ones we didn’t have in our system, causing it to appear to be a new row to us. We would then insert the row and have a duplicate row. Sometimes an new row to the system would get an old rows sequence numbers, causing us to update our existing row with the new rows information. Meaning that we lost the old rows information UNLESS it was assigned a new set of sequence numbers. Shortly after that, we tried to create our own compound key using 4 columns that we thought would uniquely identify a Lab row. Since the columns in question were changing due to the sequence number changes, we would occasionally get a cartesian product(2) when performing a join with the source table to determine whether or not the table should be imported. We’d abandoned trying to update the Labs table from the source until we get the new “highly unique” keys.
Of course our source never told us this was happening. We considered re-pulling all of the data in the Labs table, but we calculated that by the time we finished pulling all of the data again(1), the sequence numbers would have been re-synched and would have changed for some of the rows. Nifty huh? This fault in the system is the basis of the two actions that caused our DB server to go whacko. The deletion of 8 million rows from the Labs table, and the Access export trying to move 8 million rows into an Access DB at the same time.
So what have we done to prevent this from happening in the future? Well, the first question I asked after the system admin said the tempdb had grown to 14GB was, “How big is the drive?”. “18GB”. So we’re increasing the size of the drive the system DB tables live on. The data drives are contained on a 73GB drive, more than enough space for us now. Then I wondered why it was using so much of the tempdb instead of RAM. “How much RAM does the server have on it?”. “1 GB of DDR RAM.” “(…)”. I think 1 GB of RAM on a production database server, especially one hosting multi-gigabyte databases that are only going to grow in size as time goes on is a bit light. So we’re going to find out how much RAM the servers can handle and get as close to the maximum as our budget will allow. Heck, my dev laptop at work has a faster processor than our production DB server and 1GB of RAM. Maybe we should host our DB off my laptop? (“Why is our app down?” “Scott’s playing Starcraft on the DB server.”). The second thing we did was request a snapshot of ALL of the Labs that we should have in our database, with the new “highly unique” keys included, so we can re-import the labs. We’ve also contracted a couple of true DBA’s to come in and performance test and tune our databases.(3) As well as creating some automatic metrics and a dashboard-style application for our import process. Letting us see if the number of insertions looks right w/o having to query the tables directly.
1): The mechanism our source provided for us to pull the data across is a rest-like webservice which returns XML docs based on arguments we pass to the service. It’s very, very slow. So slow, it requires two “very”s to describe how slow it is. Normally, we are only pulling small amounts of data across and it’s fast enough to meet our needs. Pulling 4 millions rows across this service would probably take about 3-5 days. Assuming it didn’t time out.
2): A cartesian product is basically the result of a join that returns every possible combination of rows between two tables. In this case we had additional constraints in the where clause which prevented a prevented a true cartesian product of the two tables from being returned, it just returned the cartesian result of the rows defined by the where clause. Which meant one day we would see 9000+ rows inserted, the next day 100K+, then 50K, then 1 million.
3): If we had a few indexes in place, we wouldn’t have used as much tempdb as we did. I’ve since added some indexes to the tables in question, which reduced the number of table scans our queries have to do to 0. But I’m sure more could be done.