Mine Your Data for Business Insights
Take advantage of SQL Server 2005 and Visual Studio to mine your data for significant patterns that might otherwise escape your attention.
EM>Technology Toolbox: SQL Server
Data mining is about using automated or semi-automated techniques to analyze your data looking for hidden patterns, and then taking advantage of those patterns to plumb your data for valuable information. At its most useful, data mining enables you to discover facts about your data and/or business that you wouldn't otherwise see.
Data mining can help you spot issues that are creating bottlenecks in a given sales area, or highlight the significance of some piece of data relative to a given customer's likelihood to buy a product from your business. Optimally, data mining does more than provide a snapshot report of your current business climate; it also helps you make predictions about where your business is going. In this article, I'll introduce you to basic data mining concepts and principles, and give you suggestions for how to apply them to common business problems.
The concept of data mining has been with us for a while. Indeed, the assets needed to take advantage of data mining are already in place in many businesses, where large volumes of data have been accumulated and stored in company databases. Unfortunately, many organizations have become data-rich, but knowledge-poor. Data mining can help you correct this imbalance.
Data mining is as much art as science, but it is gaining adherents and prominence. For example, data mining is a key member in the Business Intelligence (BI) product family in SQL Server 2005. In addition to explaining basic data mining concepts, I'll walk you through the basics of using Microsoft Analysis Services, which ships with Microsoft SQL Server 2005. Keep in mind that even though data mining can give you a good way to analyze and extract valuable information from your data it won't help your organization if there isn't a process for exploiting the information mined and sharing it throughout your organization. A common data mining approach is to employ the Cross Industry Process for Data Mining (CRISP DM) methodology (see Table 1).
CRISP DM is a reference model that shows an overview of phases, concepts, tasks, and their underlying outputs to create a data mining project and manage its entire lifecycle. CRISP DM comprises six phases: business understanding, data understanding, data preparation, modeling, evaluation; and deployment and maintenance. The sequence of the phases is flexible, so you can move back and forth between different phases whenever it is required. Each phase describes how actions should be carried out in certain specific situations according to the conceptual architecture of a data mining project.
Getting Started With Data Mining
One key aspect of taking advantage of data mining is to recognize the kinds of fields where data mining can deliver significant benefits. For example, some fields where data mining applications can make a big difference include analysis of finance data (prediction of credit risks, discovery of fraudulent transactions, and prediction of stock prices), analysis of genetic databases (discovery of correlations between gene sequences and protein properties, and the prediction of the biological effect of chemical substances), analysis of text archives (classification, retrieval of news articles, scientific publications, discovery of risk factors for diseases from medical archives), search engines (information extraction, classification of documents, clarifying words, feedback relevance, structuring and visualization of documents), and user modeling (learning user profiles, personalized classification of documents, analysis of search behavior).
Utilizing data mining effectively requires an elementary understanding of some basic concepts. First, you must consider case, which is a basic entity of information. Case is characterized by attributes, such as age or full name, as well as an underlying state. Attributes are also called variables in statistical terms. An attribute can have values called state, and attributes can have one of four different model structure types: key, input, predictable, or both input and predictable. An attribute can also have two content types: continuous and discrete. Finally, some attributes contain only input variables, such as a clustering algorithm.
A case can correspond to a row in a table, but it's possible to find a case represented by more than one table row (a nested case). A case key consists of the attributes that identify a case uniquely. For example, the primary key of a table is used often as a case key, although sometimes the case key is composed of several primary keys, as happens with nested cases.
A data mining model is characterized by key columns, input columns, and predictable columns, as well as the data mining algorithm on which you train the model. Training a data model means finding patterns in the training dataset by using a specified algorithm and the underlying parameters for tuning that algorithm. The model also stores the discovered patterns. Model training is a process for discovering knowledge about the training dataset. You use model prediction to apply discovered and trained patterns to a new dataset for predicting potential values.
The data mining algorithm is the mechanism that creates mining models. To create a model, an algorithm first analyzes a set of data, looking for specific patterns. The algorithm then uses the results of this analysis to define the parameters of the mining model. The data mining algorithm depends on several structures, such as a set of rules, a decision tree, a mathematical model, and a set of clusters. You define data mining queries using Data Mining eXtensions (DMX).
Microsoft's Analysis Services provides six types of data mining algorithms—Classification, Clustering, Association, Regression, Sequence Analysis, and Deviation Analysis (see Table 2). Fortunately, the data mining approach employed by Analysis Services is rather simple to implement. All you need to do is to select the right data mining algorithm and specify the input columns and the predictable columns, which are the targets for the analysis.
Create a Sample Project
That's it for the basics of data mining. The next step is to create a simple data mining project. You have two options when working on a data mining project: immediate mode and offline mode. Working in immediate mode is generally a more natural experience for data mining users. You are connected directly and continually to an Analysis Services server when you work in immediate mode. For example, opening an object such as a mining structure means that you are opening the real object from the server. When you change the object and save it, the object is changed on the server immediately.
Getting started in immediate mode requires three steps. First, launch BI Dev Studio from All Programs | Microsoft SQL Server 2005 | SQL Server Business Intelligence Development Studio. Second, Choose Analysis Service Database from File | Open. Third, enter the name of the server and database in the Connect to the Database dialog, and then create the underlying solution.
When working in offline mode, your project contains files that are stored on your client machine. As you make modifications to objects in this environment, the changes are stored in XML format on your hard drive. The models and other objects aren't created on the server until you decide to deploy them to the destination server. This provides you, as the data mining developer/analyst, with the ability to design and test your models on a test server before deploying them to a live server.
This article's example relies on taking advantage of offline mode. Specifically, I'll show you how to create a data mining project that provides discovery information about your customers, the type of customer, and the customers' relationships to the territory where they live. You are going to create clusters that associate the hidden patterns while working in offline mode.
To work in offline mode, launch BI Dev Studio, and then choose All Programs | Microsoft SQL Server 2005 | SQL Server Business Intelligence Development Studio. Next, choose Analysis Services Project from File | New Project.
You want to create your data mining objects and define the data source you want to be analyzed, as well as the underlying structure and the data mining model. Begin by defining the data source object, which consists of the connection string to the data management system and additional information about the connection. Next, right-click on the Data Sources folder in the Solution Explorer and select New Data Source to launch the Data Source Wizard.
The wizard lets you choose your data source provider. Choose "Native OLE/DB, SQL Native Client" and then enter the name of the server, the authentication mode (Windows Authentication), and select the example database AdventureWorks provided by Microsoft SQL Server 2005. The next page of the wizard is for impersonation information for accessing the database; choose "Use the service account" option. Finally, select a descriptive name for this data source.
Create a View Object
The next step is to create a data source view object. This object lets you select, organize, and explore the data source. In this step, you choose the tables that contain the cases.
Right-click on the Data Source Views folder in the Solution Explorer and select New Data Source View to launch the Data Source View Wizard. In the first step, choose the data source created earlier. Click on the >> button on the Select Tables and Views Wizard page to move all tables from the Available Objects list to the Included Objects list. Next, select the Sales.Customer table you want to analyze and set a descriptive name for this object, such as Sales_Customer Views. You can right-click on the view object and select explore data to get information about this data.
You can also create a customized view of the data source. Right-click on the data source view, select "New Named Query," and create a query for selecting important attributes to be analyzed. In your case, select the CustomerId, TerritoryId, and CustomerType attributes from the table Sales.Customer. In the Create Named Query Wizard, set a descriptive name for the data view, such as Sales_Customer_Territory_CustomerType. This SQL query will suffice to get you started:
You're now ready to use the Data Mining Wizard to create the major objects in the data mining project, which consist of your data mining structure and the data mining models.
A data mining structure defines the domain of a mining problem; the mining model is the application of a mining algorithm to the structure. A mining structure is composed of a list of structure columns that have data and data types. It is bound to an associated data source, and includes additional information about how the data is modeled. The data mining model is the implementation of a data mining structure, as well as an algorithm that contains the underlying parameters. You must determine how each column of the structure will be used in the model, as well as the underlying algorithm-specific mapping. How you use your algorithms depends on the business problems and capabilities of your business intelligence server.
You can use the Data Mining Wizard to create the data mining structure and the data mining model objects. Right-click on the Mining Structure Folder in the Solution Explorer, and then choose New Mining Structure. On the first page, you choose whether you want to create a model from a relational or multidimensional source (an OLAP cube). In cases where mining algorithms don't support the creation of models from a cube, select "From existing relational database or data warehouse."
The next page asks you which algorithm to use to create your initial mining model. Choosing which algorithm you are going to use is dependant on the business problem you want to solve. For example, use the Microsoft Clustering algorithm for grouping cases that show evidence of similarity.
The Microsoft Clustering algorithm uses a k-means algorithm. The basic idea of this algorithm is to initialize k points randomly as the center of the k clusters, and then assign each case to the nearest central point according to a measure function, thus forming new clusters. Later, a mean point for each cluster is calculated, and this is the new central point for each cluster. Once again, the algorithm assigns each case to the nearest central point, and so on, until no new clusters are created. At this point, the algorithm has converged. Note that the convergence of the algorithm depends strongly on the randomly initialized central points.
Select the Proper Case
The next few pages of the wizard let you select the case table from some defined views, as well as whether the table is nested or not. Choose the named query Sales_Customer_Territory_CustomerType.
The next step is to specify how the mining algorithm interprets the attributes. Specifically, you need to dictate whether attributes are key, input, or predictable. You can click on the suggest button to perform a quick entropy-based analysis that tells you which columns are likely to provide useful results. (see Figure 1).
The next wizard page of interest is called "Specify Columns Content and Data Type." This page lets you select the respective data and content types for each attribute. It's important that you select the appropriate content type (discrete or continuous) for the performance and accuracy of the model. Clicking on the Detect button causes the wizard to sample and analyze the source data for choosing the appropriate content type. Describe the CustomerID and TerritoryID attributes content type as discrete. The wizard concludes after you set a descriptive name for the data mining structure and data mining model container.
At this point, you have a model you can use. The next step is to train the model by running it against the server. Data mining has the ability to process all the models in a structure in parallel on a single data read. Before processing a newly created structure or model, you must deploy the objects in the server. In the Build menu, select the option Deploy solution. The model is trained automatically in this step. You receive the "Deployment Completed Successfully" message when this step finishes.
Once you create and train a model, the next step is to interpret it. Each data mining algorithm is associated with a content viewer on the Mining Model Viewer tab. This viewer is specially designed to display patterns discovered with the underlying algorithm. I've walked you through aspects of using the cluster algorithm, so let's take a look at the results as seen through its viewer. Do this by navigating through each tab of the cluster viewer (see Figure 2).
The first tab, Cluster Diagram, shows the results uncovered by our cluster algorithm model. In this example, you can see that the model found seven clusters and the relationship between them. Each cluster has a different color intensity that shows the number of cases that belong to the former cluster.
You access the Cluster Profiles view from the second tab. This view contains a column for each cluster in your model and a row for each attribute. This setup makes it easy to see interesting differences across the cluster space. Using this view, you can choose an attribute of interest and visibly scan horizontally to see its distribution across all clusters. When an item catches your interest, you can look at neighboring cells or other cells of the same cluster to learn more about what that cluster means.
The Cluster Characteristics view enables you to compare clusters one-by-one. You can use this process to refine your view of a given cluster. The last tab is Cluster Discrimination. It consists of variables, values of the variables, and the complements from a selected cluster to other clusters.
You can rename a cluster for future use once you review and understand a given model. This is easy to do—simply right click on the cluster you want to rename and select the Rename Cluster action.
That's about it; you're now ready to apply the lessons of data mining to your own applications using SQL Server 2005 and Visual Studio. The power of data mining consists not in telling you what you already know, but suggesting relationships that you might never think of without the aid of automated analysis.