Practical .NET

Performance Tips: Speed up Data Access

Faster performance in business applications comes down to reducing the "critical two" -- disk I/O and trips to the server. Doing that means doing a lot of things right, starting with your database design -- and tweaking your ADO.NET code helps, too.

I once cut the average response time for an application from more than four seconds to less than two. In another case, I reduced the runtime of a batch application from seven hours to 45 minutes. In a third case, I eliminated a problem that caused an application to run slower and slower as the day went on until, finally, it would crash at about 2 p.m. In all of those cases, I did the same thing: I sped up data access.

Data access slows business applications down. Don't bother with programming changes that save milliseconds every thousand times through a loop. That's like speeding in the city -- you just get to the next red light (the next database access) as fast as possible. Speeding up your application comes down to reducing the "critical two" activities in business applications: trips to the database engine and the number of disk I/O operations when you get there.

If you're reading this article for some ADO.NET programming tips, you're going to be disappointed (though, by the end of this article, I'll be talking about what you can do in ADO.NET). Speeding up data access begins with your database design. If you get the database design right, everything is easy. If you get the database design wrong -- make sure you're paid by the hour.

Speeding up database access comes down to what happens as you read sectors from a disk drive. If you need rows A and B and they're both in the same sector, the second retrieval goes much faster. So, to quote that great philosopher of database design, O. Newton-John: "Let's get physical."

Database Design
An important part of relational database design is applying (at least) the first three levels of normalization. The second and third levels of normalization (ensuring that all columns in a row are functionally dependent on the primary key and not each other) are great for performance because they ensure that many updates require changing only one row. Fewer rows equal better performance. But those levels actually make things worse when retrieving data, because they increase the number of tables that must be joined during the retrieval operation. As the number of rows to retrieve increases, more joins equal much worse performance.

So, you have a trade-off: Normalizing to reduce disk I/O in updates (good) will increase joins in retrieval (bad). You're going to see this kind of trade-off when optimizing: some scenarios get better and others get worse. You're often going to find yourself deciding if you want your arm cut off or ripped off.

The solution here is to have two databases: An "operational" database to handle online, day-to-day, interactive transactions and a reporting database. This "separation of concerns" can often result in better response times in the operational database just by shunting reporting (which has high disk I/O) to another server.

Take the operational database to the highest level of normalization that you're comfortable with (at least to the third level). Joins don't matter much in this database because many transactions require rows from just one table. Even when joins are required, typically only a small number of rows are being joined.

Take the reporting database only to the first level of normalization (removing repeated fields). Leaving tables "unnormalized" reduces joins -- a good thing, because reporting usually involves hundreds or even thousands of rows. Reporting tables are typically updated in a highly controlled fashion (often just inserts performed, done at night in a batch), so you don't have to worry about the update anomalies that the higher levels of normalization protect you from.

Indexes
Developers often regard indexes as the solution to all their data-retrieval problems. However, many indexes don't speed up retrieval and do slow down your updates. The problem is that, whenever you update an indexed column, you force an update to one or more "index tables" -- meaning more disk I/O. Because indexes are highly organized, making updates in them is often more expensive than the update to the table.

You shouldn't be shy about applying indexes to tables that don't see many updates -- such as those in your reporting databases. Your operational tables, however, typically see lots of updates, so you should justify every index on an operational table -- take them all off and run typical operational queries to see if the index is actually speeding up anything. You may be surprised at how few indexes are actually making a difference. You may find that only the indexes that your database engine won't let you delete (the ones on primary and foreign keys) are giving you any benefit.

For instance, if an index won't let the database engine read less than a minimum percentage of the table, the engine won't use the index. Rather than ratcheting back and forth across the surface of your hard disk picking up rows from random sectors, it's faster for the engine to drop the drive head at the top of the table and just let the disk swing all of the table's sectors past the drive head. An index on gender, which often divides a table's rows into two groups with each making up 50 percent of the table, probably isn't doing you any good.

In addition, an index only makes sense if the table is large. Most database engines will load small tables into memory. Once in memory, searching a table will go quickly, no matter what you do or don't do. What "small" means will depend on the number of rows, the size of each row and how much memory your database server has available.

Which brings up another point worth mentioning: If you want an easy solution for slow applications, consider throwing more memory on your database server or your disk-management subsystem. Both places automatically store frequently accessed data in memory (reducing disk I/O) provided you've given them enough memory to hold the data.

Considering Engine Optimizations
If you haven't spent some time spelunking through the documentation of your database engine, now would be a good time to start. Every engine has some options on it that can improve performance, at least under some circumstances. But, remember, every change speeds up some operations and slows down others. Remember that application whose average response time I cut in half? That speed gain was due primarily to applying a clustered index. But remember the application that kept slowing down until it crashed every afternoon? That was also due to a clustered index.

Clustering on an index causes the rows in a table to be physically stored on the disk in the order of the index. Where a group of "related" rows are processed together, grouping those rows together will speed up retrieving them. If you're lucky, all the related rows will be in the same sector; at worst, the rows will be in logically adjacent sectors.

As you can only cluster on one index, you need to pick it wisely. When you retrieve a sales order in your operational database, for instance, you typically retrieve just one sales order -- clustering on the SalesOrderId won't get you much. On the other hand, if your application typically retrieves all the orders entered on a specific date, a clustered index on the DateEntered column would speed up that operation.

But clustering on an index increases disk I/O during updates. When you insert a new row without clustering, the row can, effectively, be inserted in the "open space" at the end of the table. Clustering on DateEntered won't change this, presumably because each sales order has a later DateEntered than the previous order: New rows will always be added at the end of the table.

However, if you cluster on, for instance, the CustomerId column, inserting a sales order for customer "B123" will require inserting a row between existing orders for customers "A123" and "C123." This is going to mean extra disk I/O: Some shuffling of data to new sectors is going to be required to make room in the sector for row "B123." Clustering also turns most updates on the clustered column into inserts because the row has to be relocated to the correct position for its new value. Clustering on DateEntered might still be a good choice for clustering because it's a column that doesn't see many updates. Setting the freespace percentage when implementing your table can ensure that most sectors have empty space, which will reduce I/O on inserts and updates (that was the change I made for that application that crashed every afternoon).

You need to look at how your data is used. I worked for one company that, on average, added five new sales orders a day. Each of our sales orders had a ton of detail rows associated with it, so it was important to be able to retrieve all of the details quickly. We frequently retrieved orders by CustomerId (our customers placed dozens of orders at a time that would all be processed together). We clustered the sales order table on CustomerId to speed retrievals by CustomerId and clustered the details table on SalesOrderId because we typically retrieved all the details for a sales order. Inserts to the sales order table took longer but -- at five a day -- who cared? And we never updated the CustomerId.

This kind of thinking should be applied to all of the engine optimizations available to you.

Reducing Trips
Up until now, my suggestions have focused on reducing disk I/O. This section is about reducing the trips.

Everyone recommends using stored procedures because the SQL statements in a stored procedure are compiled at design time. Submitting raw SQL statements moves that compilation cost to runtime, slowing down your application. But, frankly, for most of the SQL statements used against your operational database, the time to compile the statement is so small you can't measure it. If you want to get any performance gain from using stored procedures, you must be willing to move some of your application logic to the stored procedure so that calling the procedure reduces the number of trips to the database.

A more flexible strategy than using stored procedures is retrieving more data than you need and hanging on to it -- also known as caching. Caching assumes that data doesn't need to be retrieved every time it's needed. Instead, when you retrieve data, you store it somewhere on the client and, rather than going back to the database the next time the user asks for something, you can display data you retrieved earlier. The assumption is that your data isn't particularly volatile -- that it will be an unusual case where someone changes the data that your user has requested.

Instead of handling volatility by retrieving the latest copy of the data, handle volatility by checking during updates to see if someone's changed a row since you retrieved it. Don't lock records -- it slows everything down.

The easiest way to check if data has changed is to incorporate the row's TimeStamp into the Where clause of the update statements you use in your ExecuteNonQuery methods (and, if you're really worried about volatility, during deletes). If the TimeStamp has changed, the Where clause won't find the record and the UpdatedRows value returned by ExecuteNonQuery method will be zero. In that case, you apologize to the user, present the latest data, and let the user reenter their changes.

If you're working with ASP.NET, the Cache object supports this process well. If you're not working in ASP.NET, you have the DataSet object and generic collections. With generic collections, you create your objects as you retrieve your data and add those objects to a generic list, typically a Dictionary or a HashTable that supports storing objects by a key name. Using the DataSet tactic, you retrieve data into the DataSet and create your client-side objects from rows in the DataSet. If you're using a DataSet, use a typed dataset -- it speeds up pouring data into the DataSet when using the DataAdapter's Fill method and reduces costs around data conversions when accessing fields.

Regardless of which tactic you follow, before retrieving any data from the database, you check to see if the data is already in the DataSet or collection. When you do retrieve data, always add to the existing rows in the DataSet.

With caching, consider retrieving more data than the user asks for. While there's a cost to retrieving additional data, it's far less than the cost of another trip to the database. If you can figure out what data your users typically access and get all of that data on their initial query, you can substantially speed up your application. The batch application whose runtime I cut by more than six hours was the result of applying this strategy: I stopped retrieving data that was never used by the process and extended the existing SQL statements so that they retrieved data the process typically used.

Caching helps everyone: By reducing trips to the database server you also decrease the number of users accessing the engine at any one time. The result is faster performance for every user. For Web applications, store your retrieved data in the cache so that data retrieved by one user can be used by others.

Also, in your UI, encourage your users to do batch updates -- let users make lots of changes and then commit them as a batch to the database. A DataSet-based approach simplifies this, as you can transfer changes from UI into the DataSet. When your user finally clicks the Update button, you can scan the DataSet for changes and concatenate the necessary SQL statements into a single ExecuteNonQuery call.

Connection Pooling and Connection Strings
Creating an ADO.NET connection object takes very little time -- creating a connection at the database, however, is terrifically time consuming. Fortunately, .NET, Windows and your database engine will conspire to give you connection pooling. This means that if you open an ADO connection object and there's already a compatible connection in the pool at the database, it will be handed to you.

Capitalizing on this comes down to keeping as many compatible connections in the pool as possible. Fortunately, this is easy to achieve: Keep your Connection objects open for the least amount of time. You should open your connections just before you issue your SQL statements and close them immediately afterwards. Creating your Connection object in a using block will ensure that your objects are closed at the end of the block:

Using conn As New SqlConnection(...connection string...)
  '...using DataReader or DataCommand objects...
End Using

Passing the CommandBehavior.CloseConnection option to the ExecuteReader method ensures that your connection is closed as soon as you call the DataReader Close method:

rdr = cmd.ExecuteReader(Data.CommandBehavior.CloseConnection) 

Compatibility between your Connection object and any existing connection in the pool at the database is determined by your connection strings. Effectively, the Connection object's Open method does a character-by-character comparison of the connection string it's using with the connection strings of any existing connections at the database. If the Open method doesn't find a matching connection, a new connection at the database is created -- and this is bad.

It's critical to have as few different connection strings in your application as possible, in order to increase the likelihood that the connection string your Open method is using matches the string for a connection in the pool. One of the best reasons for keeping all your connection strings in your config files is that it makes it easy to see how many different connection strings you have and reduce them to some minimum number -- ideally, just one.

ADO.NET
Finally, you can improve ADO.NET performance by changing your code. Remember, however, what I said at the start of the article about improving performance: The key is to reduce the critical two activities of disk I/O and trips to the server. The tips in this section don't reduce either of those. You will get faster processing -- but I won't guarantee that you'll actually be able to measure it.

Improving ADO.NET performance comes down to sharing information. The more you tell ADO.NET about what you're doing, the faster it will perform. For instance, if you know that your query will return only a single row (probably because your SQL statement has a Where clause that uses the table's primary key), tell ADO.NET to pass the CommandBehavior.SingleRow option to the ExecuteReader method to reduce the amount of data retrieved. This will reduce disk I/O:

cmd.CommandText = "Select * from Customers Where CompanyID = 'ALFKI';"
rdr = cmd.ExecuteReader(Data.CommandBehavior.SingleRow)

Similarly, if you know that your query returns one column from one row, use ExecuteScalar.

When retrieving data from a DataReader, the slowest access method is to use just the column name and cast the result. This code, for instance, reads a row from a command object, then accesses the TotalSales column by passing the column name to the DataReader object. The code then uses TryParse to convert the TotalSales column to a Decimal:

Dim totSales As Decimal
Do While rdr.Read
  If Decimal.TryParse(rdr("TotalSales"), totSales) Then
    '...use totSales...

Passing just the column name to the DataReader forces it to scan the columns in the DataReader to find the right one. It returns the result as an Object, which you then have to convert (and handle the possibility of an error).

Again, giving ADO.NET more information will get you faster performance: Tell ADO.NET the datatype of the column you're accessing and which column you want by position. The datatype-dedicated methods on the DataReader will handle the data retrieval and conversion more efficiently than the non-datatype-specific code I showed before. Unfortunately, to use those methods, you have to pass the position of the column rather than the column name. This example, for instance, assumes that the TotalSales column is the third column in the row:

totSales = rdr.GetDecimal(2)

If you're using SQL Server, then you're even better off with its dedicated version:

totSales = rdr.GetSqlDecimal(2)

Of course, this code is less readable than the version with the column name. One solution is to declare a constant to hold the column's position and use it when retrieving data:

Const TotSalesPosition As Integer = 2
totSales = rdr.GetSqlDecimal(TotSalesPosition)

But this code is brittle: Rearranging the order of the columns in the Select statement will change the position of the columns in the row (or, if you're using * in your Select clause, rearranging the order of the columns in the table will cause the same failure). You can use GetOrdinal to get the column position:

Dim TotSalesPosition As Integer
TotSalesPosition = rdr.GetOrdinal("TotalSales")

totSales = rdr.GetSqlDecimal(TotSalesPosition)

However, now you're back to scanning the columns in the DataReader to find your column, which doesn't make a lot of sense if you're only processing a single row. Using GetOrdinal to find the column position can make sense, however, if you do it before a loop and use the resulting variable inside the loop:

Dim TotSalesPosition As Integer
TotSalesPosition = rdr.GetOrdinal("TotalSales")
Do While rdr.Read
  totSales = rdr.GetSqlDecimal(TotSalesPosition)

Sometimes, just creating the appearance of faster response times is sufficient. By switching to the asynchronous ExecuteNonQueryBegin/ExecuteNonQueryEnd and

ExecuteReaderBegin/ExecuteReaderEnd methods, you can do other things for your user while waiting for database accesses to complete. While doing asynchronous reads will probably require a significant rewrite of your program, the asynchronous ExecuteNonQuery can be easier to implement and make it appear as though your updates are completing earlier.

If you do all of these things, I can't guarantee that your application will run fast enough. I can guarantee, however, that your application will be running as fast as possible.

comments powered by Disqus

Featured

Subscribe on YouTube