Code Focused

Full-Text Search in SQL 2012 with Custom Proximity

Custom proximity search, a new feature in SQL Server 2012 Full-Text Search, is facilitated using the NEAR operator.

Custom proximity search is a new feature introduced in SQL Server 2012 Full-Text Search. It's facilitated using the NEAR operator, allowing it to be used in the CONTAINS predicate or CONTAINSTABLE function. The NEAR operator allows for searching words near each other. It also allows the user to specify the maximum number of non-search terms separating the first and last search terms in a match. In addition, an optional parameter can be used to specify an exact order match.

In a previous article, I created a FileTable and setup a Full-Text Catalog on it. I'll use that example as the basis for this article, and go into a detailed discussion of the NEAR operator. Please note to execute the entire script from the FileTable article, including copying the three text files into the FileTable directory. This will be needed before progressing with the examples in this article.

Using the NEAR Operator
The NEAR operator can be used within the CONTAINS predicate of a WHERE clause or the CONTAINSTABLE function. It's used for exact or fuzzy searches of a search term in text. A search term can be either a single word or a phrase delimited by double quotation marks (i.e. "my special phrase").

Switching to the FileTable example mentioned previously, you may recall the contents of the AudiA6.txt file are:

Model Year: 2011
Mileage: 18,231
Body Style: Sedan
Engine: V6
Exterior Color: White
Interior Color: Black
Interior Material: Leather

Please note this example pertains to .TXT files which are handled by default in SQL Server. To search other file types, a filter pack may need to be installed. Click here for more information on installing filter packs.

Looking at the example below, the WHERE clause utilizes the CONTAINS predicate to search the column named file_stream for the words "V6" and "Leather". Both terms must exist in the column in order for a record to match. Assuming the demo steps mentioned in the FileTables article were executed correctly, the following query should return a record, for the file AudiA6.txt.

Select name as File_Name
from dbo.CarsDocStore DS
where contains(file_stream, 'near(V6, Leather)')
Go

File_Name
AudiA6.txt

The NEAR operator can accept multiple search terms. However, all terms must exist in a table column for a match. If any of the terms don't exist, no match will be found. The order and case of the search parameters is irrelevant. The following statement searches for four terms and finds the same matching record as the previous example (for AudiA6.txt):

Select name as File_Name
from dbo.CarsDocStore DS
where contains(file_stream, 'near(v6, BLACK, leather, color)')
Go

File_Name
AudiA6.txt

However, the following example won't work because the first search term ("sam") doesn't exist in combination with the other terms specified:

Select name as File_Name
from dbo.CarsDocStore DS
where contains(file_stream, 'near(sam, v6, BLACK, leather, color)')
Go

When used with the CONTAINS predicate, the '~' can be substituted for the term NEAR in chaining search terms. This makes the following two statements yield the same results as the previous two examples:

Select name as File_Name
from dbo.CarsDocStore DS
where contains(file_stream, 'v6 near leather near color')
Go

Select name as File_Name
from dbo.CarsDocStore DS
where contains(file_stream, 'v6 ~ leather ~ color')
Go

Custom Proximity Term
The Custom Proximity Term is a new feature that allows the user to specify the maximum distance between search terms; it's an optional second parameter. The following example searches for the terms "Exterior" and "white", with a maximum distance of three words between them:

Select name as File_Name
from dbo.CarsDocStore DS
where contains(file_stream , 'NEAR((WHITE, exterior), 3)');
Go

File_Name
AudiA6.txt

This returns the record for the AudiA6.txt file because it contains the two search terms, even though they don't occur in the order specified and the case of "WHITE" isn't an exact match.

The NEAR operator also allows an optional third parameter, of Boolean type, used to specify exact matches or not. If I utilize this parameter in the same example, it won't return a record because the terms don't occur in the order specified.

Select count(*) as Total_Count
from dbo.CarsDocStore DS
where contains(file_stream , 'NEAR((WHITE, exterior), 3, true)');
Go

Total_Count
0

The NEAR term can also be compounded using AND(&), OR(|), or AND NOT(&!) operations, as seen in the following example. Once again, the order of the terms is irrelevant because the optional Boolean parameter is excluded.

Select name as File_Name
from dbo.CarsDocStore DS
where contains(file_stream , 'NEAR((WHITE, exterior), 3) AND NEAR ((interior, black), 3)');
Go

File_Name
AudiA6.txt

Maximum Distance
The NEAR operator allows a maximum of 64 search terms. For a match to occur, all search terms must occur within the specified maximum distance, starting from the first term to the last term.

The maximum distance also includes "Stopwords",  which are words that don't have a linguistic meaning such as "a," "and," "is," and "the", and so on. These words are excluded from the full-text index since they don't affect a search. For more information on Stopwords, please see the MSDN documentation.

If the search terms span different sentences, paragraphs or chapters, the distance used is increased by 8, 128, or 1024, respectively. Note the following example that searches for the colors "white" and "black". If you recall, these are the colors of the exterior and interior, respectively. The terms are separated by two words but located on separate lines, and considered as separate paragraphs. Because of this, the distance value had to be increased to 130 in order for a match to be found. Otherwise, a lesser value yields no match.

Select name as File_Name
from dbo.CarsDocStore DS
where contains(file_stream , 'NEAR((WHITE, black), 130)');
Go

File_Name
AudiA6.txt

Please note the Custom Proximity Term is a replacement for the old Generic Proximity Term, which is still available but will be deprecated in upcoming versions of SQL Server. Go here http://bitly.com/1nJjQw for more information on the Custom Proximity Term.

More Powerful Searching
With the inclusion of the NEAR operator in SQL Server 2012, Full-Text searching with custom proximity can be implemented for fuzzy or exact matches. Detecting the proximity of search terms aids in gathering more meaningful information from data. Combined with FileTables and filter packs, Full-Text searching is now powerful than ever before.

About the Author

Sam Nasr has been a software developer since 1995, focusing mostly on Microsoft technologies. Having achieved multiple certifications from Microsoft (MCAD, MCTS(MOSS), and MCT), Sam develops, teaches, and tours the country to present various topics in .Net Framework. He is also actively involved with the Cleveland C#/VB.Net User Group, where he has been the group leader since 2003. In addition, he also started the Cleveland WPF Users Group in June 2009, and the Cleveland .Net Study Group in August 2009, and is the INETA mentor for Ohio. When not coding, Sam loves spending time with his family and friends or volunteering at his local church. He can be reached by email at [email protected].

comments powered by Disqus

Featured

  • Diving Deep into .NET MAUI

    Ever since someone figured out that fiddling bits results in source code, developers have sought one codebase for all types of apps on all platforms, with Microsoft's latest attempt to further that effort being .NET MAUI.

  • Copilot AI Boosts Abound in New VS Code v1.96

    Microsoft improved on its new "Copilot Edit" functionality in the latest release of Visual Studio Code, v1.96, its open-source based code editor that has become the most popular in the world according to many surveys.

  • AdaBoost Regression Using C#

    Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the AdaBoost.R2 algorithm for regression problems (where the goal is to predict a single numeric value). The implementation follows the original source research paper closely, so you can use it as a guide for customization for specific scenarios.

  • Versioning and Documenting ASP.NET Core Services

    Building an API with ASP.NET Core is only half the job. If your API is going to live more than one release cycle, you're going to need to version it. If you have other people building clients for it, you're going to need to document it.

  • TypeScript Tops New JetBrains 'Language Promise Index'

    In its latest annual developer ecosystem report, JetBrains introduced a new "Language Promise Index" topped by Microsoft's TypeScript programming language.

Subscribe on YouTube