Real-World Data Mining
Build a real-world data mining app that enables you to examine historical customer data and make predictions about new customers by comparing the two.
Technology Toolbox: SQL Server, SQL Server Business Intelligence Development Studio, Microsoft SQL Server Management Studio 2005
Data mining and visualization is one of the most significant aspects of SQL Server 2005 -- and among the least well understood. Its goals are easy enough to understand: You want to achieve better business efficiency by seeing your data in new ways. But implementing the technology to take advantage of these features often leaves developers at a loss, as much for conceptual reasons as technical ones. In a recent article, I explained some of the core concepts of data mining, what it's for, how you take advantage of it, and how to implement it in your own applications [Database Design, "Mine Your Data for Business Insights," VSM January 2007]. That article walked you through the general approach of setting up and exploiting data visualization models with Microsoft Analysis Services 2005.
While useful and broadly applicable to many different circumstances, that approach generated several questions from readers who wanted to know how to take a specific, real-world project and apply the principles of data mining and visualization to it. Your wish is my command.
Let's analyze a business scenario. Assume you have a retailer that wants to increase its operations and needs to analyze the historical data of its customers' shopping behavior to enable better service. For example, the retailer might want to determine the preferred shipping methods for a given set of conditions, and see if it can glean insights into serving the bulk of its customers better. Specifically, the retailer might look at its customer buying patterns in the context of which cities its customers live in, as well as the customer's title. By analyzing this data, the retailer hopes to determine better ways of interacting with all its customers, as well as to discover and recommend what are likely the most efficient ways to handle distributing its products and services to its customers.
This business problem centers on clustering data mining tasks (also known as segmentation). The idea is to create natural groupings of cases based on a set of attributes in order to group entities with more or less similar attributes. (see Figure 1). The algorithm that you'll use to solve this task is called Microsoft Clustering.
Before I walk you through the specific steps required to create the server and client applications, let's take a couple minutes to go over the recent history of data mining and where and how it is used. This context is important for getting the most out of the solution I'll describe in this article.
Data mining requires three basic steps (see Figure 2). The first step is to define a mining model, such as input attributes, predictable attributes, and the associated algorithms. The next step is to train (process) the models and store the discovered patterns. Finally, you need to be able to analyze data for future predictions and assigning prediction results to new data.
Real-World Requirements for DM
OLE DB for Data Mining (OLE DB for DM) specifies a common interface for accessing DM engines and is supported by several vendors. Its goals are to provide an industry standard for data mining and comprises of a set of specifications, several predictive functions, and a set of schema rowsets (metadata) for discovering mining models and their underlying structures. This technology relies on the Data Mining eXtension (DMX) query language which you can use for model creation, training, and prediction.
To use OLE DB for DM, you create a connection object to access the mining engines, define and execute a prediction query using a command object, and get the model results represented by a result set. Several data mining vendors have developed their own OLE DB for DM providers; you can access these products through OLE DB, ADO, and ADO.NET.
Another key data mining technology -- XML for Analysis (XMLA) -- provides a way for DM engines to expose their functionality, mining models, and metadata through XML messages and standards-based Web protocols such as SOAP and HTTP. XMLA includes two types of requests: Discovery and Execute. You use a Discovery request to get metadata from the server. This information might include the available DM algorithms, available mining models, as well as parameters and metadata. You use the second kind of request, Execute, to send command messages to the server. Command messages include (but aren't limited to) creation, training, and query of a model.
XMLA also relies on the Data Mining eXtension (DMX) query language. XMLA also leverages OLE DB design concepts, which enables developers to protect their existing technologies and tools because XMLA is flexible, allows universal data access, and eliminates the need to deploy client technologies such as COM or Microsoft.NET framework interfaces. XMLA is also the native protocol for Microsoft SQL Server 2005 Analysis Services (SSAS) when communicating between the client applications and instances of Analysis Services.
The main goals of DMX are to leverage the SQL language and map some relational concepts to a data mining approach, making this information accessible to any database developer. For example, a mining model might be mapped to relational tables, and several relational operations might be mapped to mining tasks. DMX is similar in structure to SQL and comprises Data Definition Language (DDL) structures, Data Manipulation Language (DML) instructions, and functions and operations for discovering metadata in the Analysis Services engine.
You use DDL to create a mining model and its underlying structure (CREATE MINING STRUCTURE, CREATE MINING MODEL), delete existing mining models and structures (DROP MINING STRUCTURE, DROP MINING MODEL), export and import mining structures (EXPORT, IMPORT), and copy data from one mining model to another (SELECT INTO). You use DML to train mining models (INSERT INTO), browse data in mining models (SELECT FROM), and make predictions using a mining model (SELECT ... FROM PREDICTION JOIN).
Develop Your App Structure
That's enough background; let's get to the fun part, which is designing your data mining app and creating the code that does the real work in your project. This article assumes you're using the SQL Server Business Intelligence Development Studio (BI Development Studio), which includes several key wizards for creating data mining apps. The article also assumes you're using Microsoft SQL Server Management Studio 2005 for the creation of DMX queries because this article focuses on the programmatic point of view.
Implementing this real-world data mining solution requires creating separate server and client applications. The server application creates and trains mining models, while the client application (a Windows Forms or Web Forms application) consumes the server application and makes predictions based on the mining model.
This example relies on the Northwind database, which you import into Microsoft SQL Server 2005, as well as the tables Customers and Orders, which represent the underlying business entities. The main steps you need to carry out on the server side include creating a data source and a data source view, creating a data mining model, training the data mining model, and creating the prediction queries.
Creating a data source and a data source view is straightforward. Begin by defining a data source to the Northwind database, and then create an associated data source view for these tables and a named query, OrdersByCustomers, using the BI Development Studio:
select Customers.CustomerID, Customers.City,
from Customers inner join
Orders on Customers.CustomerID =
Now you need to deploy these DM objects to the Analysis Services engine. Go to the Build menu on the main menu and select Deploy the Solution. Next, you need to launch the Server Management Studio and connect to the Analysis Services engine for executing some DMX statements. The first step is to create a mining model:
create mining model
CustomerID text key,
City text discrete,
ContactTitle text discrete,
ShipVia text discrete predict
Once you have a mining model, you need to train it. The training statement has the same syntax as the relational Insert statement, with the only difference being that the target object is a mining model. The OpenRowset statement lets you query data from any data source through an OLEDB provider. It accepts two sets of parameters. The first set specifies an OLDEDB provider, the associated connection string, and a database object or SQL query:
The second set of parameters specifies the OLEDB provider, data source, user id, and password:
OpenRowset ('provider_name', 'data_source'; 'user_id';
Executing the OpenRowset statement successfully requires that you set a couple of properties. Set the server property DataMining\AllowAdHocOpenRowsetQueries to true and DataMining\AllowedProvidersInOpenRowset to SQLOLEDB and Microsoft.Jet.OLEDB. You can find these properties by right-clicking on the server registration name in the Object Explorer and selecting Properties from the context menu. Then select the Show Advanced (All) Properties checkbox in the Analysis Server Properties.
This DMX statement trains the mining model from an external data source:
insert into OrdersShipViaByCustomersDModel
(CustomerId, City, ContactTitle, ShipVia)
'select Customers.CustomerID, Customers.City,
from Northwind.dbo.Customers as Customers
inner join Northwind.dbo.Orders as Orders on
You have a trained data mining model and identified clusters or segments of customers at this point. Ideally, the cases in each group should have a similar profile with respect to specific characteristics (City, ContactTitle, ShipVia), while profiles of cases belonging to different clusters should be highly dissimilar. The goal of customer profiling is to guess what a customer will do in the future given that you might have little data about any single customer, such as City and ContactTile. The clustering technique allows you to borrow data from similar customers by assigning a new customer to a cluster and indicating the likelihood that the customer will behave in a particular way. This can give you insight into your customer's needs, enabling you to make some recommendations or special offers based on that customer's attributes.
For example, you know the city where the customer lives and his contact title, and you would like to guess at the customer's likely preferred shipping method, whether it's DHL, UPS, or Federal Express. Predicting this information would enable you to serve the customer better, or even to make a suggestion that works better for that particular customer. The neat part about this approach is that it is highly adaptable. For example, you could just as easily apply this approach to create an e-business recommendation system, following the model of many of today's online stores. Gathering just a little information from your customer enables you to make a much broader set of predictions about the customer and what he or she might like.
Interpret the Results
The next step is to interpret the created clusters. Begin by opening your data mining model in the Object Explorer in Microsoft SQL Server Management Studio. Next, right-click on the model and select Browse from the context menu to open the Mining Model Viewer windows.
The first tab is Cluster Diagram, which lets you see all your clusters connected by lines. Each cluster is positioned closer to other similar clusters and the level of darkness in the connecting lines expresses the degree of similarity. The shading of each cluster shows how many customers the sample population contains; the darker the cluster, the more customers it holds.
Now let's look more closely at what the data the mining model tells you about your customers. The second tab in the Mining Model Viewer is Cluster Profiles, which lets you compare the composition of the clusters (see Figure 3). You can see all the discovered clusters in the columns and each attribute in the rows. Here you can see the most representative states for each attribute. For example, London, Rio de Janeiro, Sao Paulo, and Boise are the most representative cities in the sample population, and Sales Representative, Owner, Sales Manager, and Accounting Manager are the most representative contact titles.
The next step is to create prediction queries for new customers. Prediction queries find hidden patterns inside your historical data and represent the end goal of a business-intelligence solution. There are two types of prediction queries: batch and singleton query. A batch query has multiple input cases, which you can store in a table. A singleton query includes only one input case, which you build in real time.
The prediction statement uses the same syntax as the relational Select statement. In this case, you want to predict possible clusters for each new customer in a data set stored in an external Access database. Once you determine a customer's possible cluster, you can use that information to predict his future behavior because customers inside a common cluster typically behave similarly:
select t.*, OrdersShipViaByCustomersDModel.*,
$Cluster as [Cluster], ClusterProbability() as
'select CustomerID,City,ContactTitle from
on OrdersShipViaByCustomersDModel. CustomerID =
and OrdersShipViaByCustomersDModel.City = t.City
and OrdersShipViaByCustomersDModel.ContactTitle =
Your goal with this application is to determine the most appropriate cluster for a new customer and indicate the underlying likelihood that they belong to a given cluster based on the information gathered about the customer. In this case, assume you know the city where the customer lives and his contact title, and you would like to recommend a good shipping method. To do this you need to write a singleton query against your mining model.
The purpose of this query is to predict a future cluster and its likelihood to share a cluster with a customer from New York who is a sales manager:
select t.*,OrdersByCustomersClustering.*,$Cluster as
[Cluster],ClusterProbability() as [Probability]
natural prediction join
( select 'New York' as City, 'Sales Manager' as
[ContactTitle]) as t
That's It for the Server-Side Code
Now you need to develop the client-side components that will interact with the server-side components.
The client side portion of this system is a Windows Forms application that consumes the data mining models. The application gathers key information about new customers, such as City and Contact Title. Once you have this, you send a DMX query command to Microsoft Analysis Services 2005 server, which makes a cluster assignment prediction. The results of this query are displayed in the client-side interface.
This client-side portion relies on Adomd.NET, the Microsoft.NET Framework interface designed specifically to communicate with SSAS. It uses the XML for Analysis (XMLA) protocol to access analytical data sources using either TCP/IP or HTTP as the transport protocol. This API supports different query languages, including Multidimensional Expressions (MDX), DMX, and Analysis Services Scripting Language (ASSL). The object model in Adomd.NET extends the key interfaces from the System.Data namespace such as IDbConnection, IDbCommand, and IData-Reader. The object model also supports the features of an analytical solution, and you can work with metadata either by retrieving an OLE-DB compliant schema rowset or by using the ADO.NET object model. This library can be installed either as part of SQL Server 2005 Connectivity Components, or it can be download as an MSI package from Microsoft's Web site.
The Adomd.net components reside in the Microsoft.AnalysisServices.AdomdClient namespace, which is hosted in the Microsoft.AnalysisServices.AdomdClient.dll assembly.
The main steps you need to carry out when creating the client app include creating a Windows application as the front end, creating the business object responsible for sending a parameterized singleton query to the Analysis Server engine, and running the application. The first step is to create a Windows application. Launch Visual Studio.NET and create a new project using the Windows Application template type. Give the project the name, CustomerSegmentationClientWin.
Next, you need to design the client's user interface. Begin by adding three System.Windows.Forms.TextBox control instances, two for gathering the City and Contact Title information of the new customer, and a third for displaying the prediction results. Then add two System.Windows.Forms.Label control instances to show descriptive messages. Finally, add a System.Windows.Forms.Button control instance to send the query command to the server (see Figure 1 for layout guidance).
Creating the logic for the business object that sends the parameterized query is also easy. The sample client follows the good design practices prescribed by the Model-View-Controller (MVC) and Enterprise Applications design patterns. The business layer, where the model of the application resides, comprises just one component: PredictCustomerMDXRequester, which implements the interface IPredictCustomerMDXRequester that is responsible for supplying a parameterized MDX query command and formatting the result that is returned. (see Listing 1).
You begin by creating a connection to the server through one instance of the class, Microsoft.AnalysisServices.AdomdClient.AdomdConnection (objConnection). Next, create a command through one instance of the same class to send the DMX query to the analysis server. In this case, the query only predicts which cluster a new customer would be assigned to and the likelihood that the customer will be part of that cluster expressed as a percentage decimal value. You could create a dynamic query by concatenating the string representation of the values entered by the user, but this approach is dangerous, leaving room for DMX code injection attacks or development errors. In this case, the best approach is to create a parameterized query and pass the user input as parameter values. Adomd.NET supports parameterized DMX queries:
$Cluster as [Cluster],ClusterProbability()
natural prediction join
(select @City as City,
@ContactTitle as [ContactTitle]) as t
Executing this query gives you an instance of Microsoft.Analysis-Services.AdomdClient.AdomdDataReader:
AdomdDataReader objReader = null;
objReader = objCommand.ExecuteReader();
You can browse the result metadata by calling the method GetName(), which returns the name of each column:
for (int i = 0; i < objReader.FieldCount; i++)
Next, get the value for each column of the result query by calling the method GetValue():
for (int i = 0; i < objReader.FieldCount; i++)
Now, let's create the components of the Presentation Layer using the Model View Controller (MVC) model. You design the View components first. The Controller components, which process and respond to events from the View components and interact with Model components, are created automatically by Visual Studio.NET using the code-behind approach.
You generate an event handler for the click event on the System.Windows.Forms.Button object in your user interface by double-clicking on the button object from the Form Designer in Visual Studio.NET. The generated code looks like this:
private void button1_Click(
object sender, EventArgs e)
string strASConnString =
IPredictCustomerMDXRequester objRequester =
this.m_tbResult.Text = "";
catch (Exception ex)
The business logic in the button's click event handler is straightforward. You call an instance of the PredictCustomerMDXRequester class from the business layer and send a message to predict the customers' future behavior based on the City and Contact Title information gathered from the user interface through the objects m_tbCity and m_tbContactTitle. You display the result in the m_tbResult object.
When you run the application for the case of a customer who lives in New York and is an Owner, you find the customer has the same behavior as other entities in cluster 4 with a likelihood 0.4240 or 42.4% (very high because it is close to 0.5 or 50%). (You can find more information about correlation at http://en.wikipedia.org/wiki/Correlation.) This customer isn't identical to other members of the cluster, but from a global point of view he should exhibit the same behavior. So, you can recommend this new customer use the shipping method identified by the Id 1. The effect of this choice is admittedly subtle in this case, but you're customer will likely be pleased that you guessed his preferred delivery method without even asking him. Serving your customers better -- even when they don't realize you are doing so -- is one of the major benefits of using the techniques described in this article.
Obviously, what matters most is your ability to adapt this example to your own situation. The sample described in this article should provide you a good template for creating new data mining models and making predictions that will serve your own interests better (don't forget the code download available http://code.visualstudiomagazine.com/2vs0704jo.zip" target="_blank">here. Feel free to pick apart the online sample application for inspiration.