Match by Exception
Create more flexible, data-driven solutions with the aid of a match by exception design pattern.
Technology Toolbox: VB 2005, SQL Server 2005, SQL Server Express Edition
An almost universal requirement faced by those who create data-oriented applications is the need to locate records that match a given set of criteria.
The criteria by which you match can take a variety of forms. For example, the criteria can be simple: "Find me a Customer with ID = 42." Sometimes the structure of the request can be complex: "Find me a business rule record in a table that is more specific than any other record, given multiple matching fields." A complex request doesn't require complex code by necessity. Nor does such code have to perform poorly. In fact, there's a general class of problems that you can solve efficiently using a design pattern that might best be described as matching by exception.
I'll show you how to implement such a design pattern, using a billing system as an example. I want to stress that this technique is by no means limited to accounting. If you encounter a situation where it becomes clear that you have some conditions that drive the selection of data—whether these conditions revolve around rates, rules, templates, or something else—and you're finding maddening exceptions to these selection rules, this approach can keep you from wanting to pull out your hair. There are many creative ways to apply matching by exception as part of the broader principle of data-driven design, and I've seen it used to simplify greatly what might otherwise have been an ugly solution. For example, I remember a situation where goods would move between a set of locations that were represented as multiple records containing "from" and "to"—effectively specifying a route. But these routes differed depending on the service, the customer type, and various other criteria. It was dealing with the exceptions that was tedious, but I was able to offload this aspect of the problem to business analysts because the solution was data-driven.
The billing system described in this article assumes you work for a theoretical company that invoices dozens of different customers. Suppose you discover that rates are set based on customers' location, type, and some attributes of the services being invoiced. Maintaining a Cartesian product of all possible combinations would be unmanageable in all but the most trivial cases, even though the SQL code to locate these records would be simple and fast. Another approach might be to build multiple tables to account for different levels of matching. Unfortunately, this can lead to complex, conditional processing logic at the expense of efficiency and maintainability.
A better approach is to leave the concept of a rate table as flexible as possible. This enables you to accommodate changes in rates themselves, as well as accommodate changes in how you set the rates. Under this approach, you focus on building an engine where rate records can be selected based on either specific or general needs. You then codify the set of parameters by which you define the records, not the parameter values. The end product is a table with much fewer records. Note that such a table might still be fairly wide; that is, it can have a large number of columns. You can use this approach to model underlying business truth closely. You simply model the conditions that are meaningful to the business. In some cases, a business will dream up new conditions when it has the kind of flexibility this approach offers.
In addition to explaining how to implement a matching by exception design pattern, I'll also cover the practical ramifications of this solution using a VB.NET test harness to illustrate different options for using the engine. For example, you'll see how the test harness is designed to operate against both saved and unsaved source data.
Build a Data Model
Your first step is to create a data model for the sample application. This data model contains different types of entities. One type consists of instances of services performed for customers. This is the CustomerService table. The data model includes code tables such as Customer, CustomerType, ServiceType, ContainerSize, WeightClass, Location, and SubLocation. Each table provides descriptions and other rich metadata. The latter two tables have a hierarchical relationship, as well. Finally, the Rate table serves as your source of matching, ideally producing exactly one RateAmount value for every invoice line (see Figure 1).
Your ultimate goal for the sample app is to be able to ask: "What's the proper rate amount, given a service provided for a customer?" You want to be able support both general and specific queries. For example, assume you're concerned only with customers (see Figure 2). Bob is a customer, so the rate table might look something like this (see Table 1).
Logically, you need to find a way to select the Rate with RateID = 2 when dealing with Bob and RateID = 1 for everyone else. You can extend this to include a second criterion, such as Location (see Figure 2). Bob gets a steep discount in this example. The Rate table might look something like this in this case (see Table 2).
You can also address the "ALL OTHER" value by using NULL because no value has a specific meaning to us in this context: "do not match on a customer" or "do not match on a location" (see Figure 3).
You can keep extending this data model conceptually with your filter fields to arrive at the final model. If necessary, you can also use Boolean flags and lists of candidate matching values, in addition to foreign keys. You can model lists using many-to-many relationships. For example, you might create a relationship between Customer and a CustomerList, with CustomerListID appearing on the Rate table.
You can achieve the goal of selecting the "best" Rate record in a number of ways, but one of the easiest approaches is to turn the selection into a two-part process. The first step is to select all candidate records. The second step is to select the most specific record, out of the first step's subset of records. You can also use characteristic functions or CASE statements to count criteria that are matched for each candidate record, keeping the process set-based. This enables you to leverage SQL Server for what it's great at: set-based processing. It also helps you limit the number of moving parts required.
A characteristic function is a pre-existing SQL function that can convert the true or false values of a Boolean expression into either 1 or 0. You can exploit this fact to perform conditional logic, count, or even pivot and group.
This was a common way of doing CASE-like logic before the CASE statement was introduced to SQL Server. However, there is some limited relevance for characteristic functions today. I say limited because one must consider whether or not there is an alternative way of expressing it (using a construct such as CASE) and based on the setting, what the real-world performance implications are.
Count the Facets
A good characteristic function to use in this article's example is COALESCE(SIGN(columnName), 0). This function evaluates to "1" for anything but NULL when applied to columns such as WeightClassID and CustomerTypeID in the Rate table, and "0" for NULL. You want to apply it to all columns that are potentially going to be NULL (representing "All Others") because the presence of a non-NULL value effectively defines something as being more specific than something where the field is NULL.
When you sum this value across all criteria columns, you count the number of matching columns or facets that a given candidate Rate record has. At this point, you're a good portion of the way to picking the record with the maximum facet count, which leads to the most specific match. Of course, this function relies on built-in assumptions, such as the fact that you use numeric columns that can contain NULL or values greater than zero.
Here's how you might express these conditions as a CASE statement:
CASE WHEN columnName IS NULL
THEN 0 ELSE 1 END
I wanted to test the potential for performance differences between characteristic functions and CASE statements, so I set up a loop that ran each of the approaches a million times each. Both tests gave results of 23 seconds, so I chose to use the most readable alternative: the CASE statement.
Adopting the CASE approach in the article's sample code enables you to take advantage of working with both integer-based foreign keys, as well as columns of type uniqueidentifier. If dealing with integers, you can use -1 for the NULL case that represents all others. In this case, you need to use similar join logic, checking for -1 instead of NULL. However, you also lose your ability to have foreign key constraints in this case without adding a bogus record for "All Others" to parent tables—so it's not my favored approach.
Another design requirement that would benefit the application: the ability to feed back a rate amount, not just on saved CustomerService records, but for services that are pending in memory inside a client application. This leads you to parameterize your request based on the attributes of the service instead of the CustomerService record ID.
An important premise that you can rely on is that some fields on Rate are mandatory. In other words, you don't allow them to be "All Others." You should adopt this approach as much as possible because it makes the join logic more efficient by avoiding the more costly OR operators. For the sample, assume that ServiceTypeID and an EffectiveDate are always available. You also need to consider what base condition you're matching with. In this case, you're getting rates for service types and for a given cycle. All other attributes are effectively filters.
Next, you need to consider the Rate table's data. It's possible to set up the table's data so you arrive at two or more records with the same number of facets for the input conditions, blowing the general premise of this application out of the water. Unfortunately, accounting for this is not as simple as creating a trigger that can reject data that causes duplicates.
That said, it's possible to have records with the same facet counts for a given base condition as long as it isn't possible to select them all for a given set of inputs. This issue points to another requirement: You need to detect duplicates at run time and report them. You can typically resolve duplicate rates by adding more specific records, as needed. Sometimes this might mean providing a comprehensive list of combinations on a small range of filters.
Implement the T-SQL Code
You implement the exception matching design pattern as a stored procedure. The complete implementation isn't particularly complex or code-intensive (see Listing 1).
Let's consider the earlier test case of looking for Bob's rate in NorCal. The first DML statement is the INSERT. It populates a temporary, in-memory table for all candidate records (see Table 3).
The final SELECT uses the fact that the last row has the greatest facet count; it joins back to Rate on RateID = 4 to retrieve the RateAmount.
In terms of the matching process itself, you need to provide values for the @CustomerID, @ServiceTypeID and @ServiceDate parameters. NULL values for the other input parameters match only Rate records that contain corresponding NULL values.
Table 3 illustrates why you'd be in trouble if you didn't have a specific rate for Bob in NorCal. In that case, you'd have two records with a facet count of "1"—so how would you choose one? You can deal with duplicate and missing rate records by returning matching rates as a result set that you can use to determine the record count on the client.
You can take several steps to ensure this example performs well. First, you want to ensure that the SQL optimizer eliminates as many records as quickly as possible with your matching query, in the INSERT step. Picking a clustered index makes a clear difference in the query plan. For example, this covering index gives you good results:
CREATE CLUSTERED INDEX [IDX_Rate] ON
Another thing to consider when indexing a table like Rate: In most business situations, this table is read often, but seldom updated. This generally opens the door to more aggressive indexing.
The sample application includes an option to run up_GetRateByCriteria 1,000 times. In my single-box development environment, I average one to two milliseconds per call with 25,000 sample Rate records. On a slow network, the time for this will go up, so proper application partitioning is an important consideration for application architects.
The execution plan relies on clustered index seeks, with the primary seek based on Rate's ServiceTypeID. This is not unreasonable because you make the ServiceType part of your base condition. If you had more required matching fields, you could position these near the top of the index, helping to limit intermediate results more effectively. In the final SELECT step, the query performs a clustered index scan; however, this query performs the scan on the TABLE variable representing only candidate records, which should be limited in number.
Another issue that plays into performance of this implementation is the use of ORs to detect the "All Other" conditions. It would be especially noticeable if you were to allow nullable input parameters to be treated as meaning "any." In this case, additional OR logic would be required and the compiled plan for this function would not be optimal for all combinations of inputs. In the sample implementation, however, the matching is fixed against the Rate table, so the real performance bottleneck is the clustered index seek on Rate.
If you determine you need to perform only bulk matching, then the code in Listing 1 might not be the best performer. For example, assume you want to find rates for many CustomerService records. In this case, turning the stored procedure into a scalar user-defined function wouldn't necessarily help because that would mean iterating the entire matching process for all input rows. In such a case, you can refactor the logic contained in the function to join against multiple records, making it set-based and highly effective.
An example of this refactored approach is implemented in the test harness application in the stored procedure up_RetrieveServices. Ultimately, you can assess the specific situation you face to decide which approach is best, balancing reuse versus performance. For example, you might wonder why the stored procedure up_RetreiveServices relies on temp tables, whereas up_GetRateByCriteria uses a TABLE variable. The reason is simple: These are the approaches that performed best in practice.
Create a VB.Net Test Harness
I've constructed a single screen that allows a user to do one of two things. First, it lets the user load the grid with all existing CustomerService records, pulling over the rate amounts during the retrieve. Or, the user can enter a billable item manually, which pulls back a rate amount when the user navigates between fields.
The first case employs the bulk loading approach, which adapts the code of Listing 1. The second case responds to grid events to keep the Rate column in sync with the current attributes of the row.
The downloadable ZIP file for the example includes a complete database that provides some sample data. The database is implemented using a file-based connection string and assumes that you have SQL Express installed. If you want to run it under a SQL Server2005 instance, run the included ChangeConnection.exe application and it will attempt to not only attach the .MDF file to the selected SQL instance, but also update the connection string in the source code.
Another thing to be aware of is that the application forces the tempdb database be a minimum size of 100Mb (50Mb data, 50Mb log). This is critical to achieving good performance "out of the box" because the defaults of 2Mb for data and 1Mb for log are far too small to support the bulk-matching process when dealing with large sets. If you didn't do this, the first retrieve would be extremely slow although subsequent retrieves would perform well.
Be sure to try entering a new row while working with the running application. This will either cause a valid rate amount to appear in the Rate column, or the row error icon to appear with a tool-tip message providing error details. The status bar also provides rough timings.
This article looks only at handling rates and billing, but you can extend the concept of matching by exception to many other situations. For example, I've seen this design pattern used to select routing rules, create "instance templates" (for determining what records to create, their attributes, and so on), and to facilitate table-driven security schemes.
An important caveat: This process is powerful, but it can make visualization and maintenance challenging. Knowing what a Venn diagram is probably doesn't hurt, but to use this approach most effectively, the people with strong business domain knowledge need to be in a position to construct or help construct the underlying data. One of the key benefits of the pattern is that non-developers can handle on-going changes in various rules. Developers only need to involve themselves when the app requires structural changes, such as new filter criteria. Even in this case, you can localize the changes in some well-defined spots.
Admittedly, this approach is overkill for simple or static situations. It's easy to imagine a proliferation of tables with only a few records that never change. What this framework does offer is a high level of configurability with limited performance implications, if designed with some care for a given business scenario. You can use it easily from both T-SQL and .NET code, and you can abstract it enough to consider it a general design pattern. I've seen normally sedate business analysts become quite excited when a developer explains this approach's possibilities to them.