Category Archives: Database

RDBMS, ODBMS, and their related technogies.

Mocking IDataReader using Rhino.Mocks 3.5

The other day I was writing unit tests for some legacy code and I needed to mock IDataReader. I really just wanted to populate the reader with a single row of data, then the Read() method should return false. Using Rhino.Mocks it was a piece of cake.

            IDataReader reader = MockRepository.GenerateStub<IDataReader>();

            reader.Stub(x => x.Read()).Return(true).Repeat.Times(1);

            reader.Stub(x => x.Read()).Return(false);

            reader.Stub(x => x["ID"]).Return(Guid.Empty);

            reader.Stub(x => x["FullName"]).Return("Test User");


            List<UserDTO> list = SearchProvider.ParseUserData(reader);


The magic happens in the Repeat.Times(1) statement. This tells Rhino.Mocks that when the method is called, the mock should Return the given value that number of times. After that it can return a different value.

Update: Jeremy Miller told me about the DataTableReader class in the BCL and that you can use it to stub out IDataReader. I like this better because it removes the dependency the test had on Rhino.Mocks.

            DataTable table = new DataTable();
            DataRow row = table.NewRow();
            table.Columns.Add(new DataColumn("ID"));
            table.Columns.Add(new DataColumn("FullName"));
            row["DirectoryUserID"] = Guid.Empty;
            row["FullName"] = "Test User";
            DataTableReader reader = new DataTableReader(table);

Dealing with comma delimited data in a database column

For various reasons, developers through the years have decided to store lists of data as comma-separated lists. Most programming languages include a split() function that allows you to break apart a list of data using a specified character. T-SQL does not.

I don’t remember where I got this split function from. I know I didn’t write all of it from scratch. But basically what it will do is take a delimited list of data, split it apart, and return a table where each data element is associated with the identity value you pass into the function. This is very useful if you just want to get the values for a single row.

CREATE  function Split
	@list varchar(8000),
	@identityVal int,
	@SplitChar char(1)
RETURNS @RetVal table
	id int,
	Value nvarchar(1000)
	WHILE (Charindex(@SplitChar,@list)>0)	
		Insert Into @RetVal (id,Value)
		Select @identityVal,value = (ltrim(rtrim(Substring(@list,1,charindex(@SplitChar,@list)-1))))
		SET @list = Substring(@List,Charindex(@SplitChar,@list)+len(@SplitChar),len(@list))
	INSERT INTO @RetVal (id,Value)
	SELECT @identityVal, ltrim(rtrim(@list))

So what happens if you are trying to build a view or a report and need to split apart multiple rows. You can iterate over the rows in the table you want to flatten out and call the split function for each row, passing the rows PK into the split function.

create function fnFlattenMyTable
RETURNS @RetVal table 
	PrimaryKeyId int, 
	MyListValue varchar(100)
Declare @PrimaryKeyId int
Declare @MyListColumn varchar(6000)

SET @PrimaryKeyId = (select MIN(PrimaryKeyId ) FROM MyTable)
WHILE @PrimaryKeyId  is not null --main loop through MyTable
	Set @MyListColumn = (select MyListColumn from MyTable where PrimaryKeyId  = @PrimaryKeyId )
	insert into @RetVal(PrimaryKeyId , MyListValue )
	select id, value from [Split](@MyListColumn , @PrimaryKeyId , ',')
	SET @PrimaryKeyId  = (select MIN(PrimaryKeyId ) FROM MyTable where PrimaryKeyId  > @PrimaryKeyId )


This function uses one of my favorite sql tricks. It uses a while loop to avoid having to declare a cursor . The logic is pretty simple, on each loop, select the lowest value of the PK in the table that is higher than the previous PK. That’s the roundabout way of saying, “Select the next highest PK from the table”.

Right now, this function is hard coded to flatten the table “MyTable”, but it could be re-factored into a general use function.

Re-associate SQL users with logins

I’ve been getting our test environment in sync with our production environment the past week. Part of this effort involves moving database from our production environment to our test database server. The easiest way to do this is to put the production server into single-user mode, copy the .MDF and .LDF database files over to the test database server, and the attach the database files to the test database server.

The unfortunate side effect of this is that the SQL users in each database are no long associated with the SQL Login. SQL Server has had an sp called ‘sp_change_users_login‘ since SQL Server 2000. You can pass in the ‘Auto_Fix’ parameter and the name of the SQL user that is orphaned and SQL Server will do it’s best to automatically match the user name up with a SQL login. There are a few gotchas that the MSDN article, and people better versed in SQL Server internals than I, can spell out. But for 90% of the cases out there, this SP will work fine. But it only works on one user at a time. So I created the world simplest code generator to help me create all of the statements needed to attempt to reassociate my users with their logins.

select 'exec sp_change_users_login ''Auto_Fix'',''' + name + '''' from sysusers where uid > 2 and uid < 16000

uid 2 is the guest account and uids > 16000 are the built in "db_*" users.

Language performance doesn’t matter when a database is involved

All of the latest Rails/Twitter performance bruhaha made me think about some advice I got a long time ago and that I dish out whenever someone asks me about some performance concerns they have with their code.

Nothing else matters once you hit the disk. Once you do any kind of activity that involves reading/writing to a hard disk, that activity instantly becomes your greatest performance drag. No matter how slow any language is at interpreting/JITing/compiling, it’s still orders of magnitude faster than interacting with a hard disk with physical, moving parts.

Bad things to hear from your DBA

It’s not good to hear your DBA ‘guru’ answer “Good question” when he’s been working for 2 months creating insert and update DTS packages for your application and you ask him “How can you tell if you are supposed to insert or update a given record in the source table?”.

Especially when NOT being able to tell the difference would result in over 500K duplicate inserts every day. (especially when the reason you’ve asked him that question is because you noticed his DTS package would result in the 500K dupe inserts/day).


Size of scientific datasets

I deal mostly with medical data, some appointment data, but mostly data about a patients labs, medications, and what they are having taken out or put into them. These databases can get pretty big, right now I’ve got three main databases. One is about 1.5GB, the second is abou 2.3GB, and the third, which is a staging database for the other two, is about 4GB. These aren’t very big compared to the multi-terrabyte (10^12 or 2^40 bytes) databases you’ll see at Fortune 500 companies or at Google.

But then you start to look at the data sets that supercolliders turn out. These supercolliders can churn out 10 petabyte (10^15 or 2^50 bytes) datasets each time they run. Some astronomical data sets run into the 10PB/year range. None of these are in a RDBMS yet. None can handle that much data all at once yet. Datasets in this range fascinate me because I can’t really visualize a multi-TB dataset, let alone a single petabyte. I know that some of the researchers here have multiple 100+GB Firewire external hard drives chained together holding DNA and protein sequence information. Usually in hundreds and thousands of Excel spreadsheets(groan).

Check out this PowerPoint by Jim Gray of MS Research where he talks about some of the larger datasets. Interesting that Google was a 1.5PB as of spring 2001. WAY before the GMail initiative or Google Video upload. I wonder where they are now?