Code Focused

Full-Text Search in SQL 2012 with Custom Proximity

Custom proximity search, a new feature in SQL Server 2012 Full-Text Search, is facilitated using the NEAR operator.

Custom proximity search is a new feature introduced in SQL Server 2012 Full-Text Search. It's facilitated using the NEAR operator, allowing it to be used in the CONTAINS predicate or CONTAINSTABLE function. The NEAR operator allows for searching words near each other. It also allows the user to specify the maximum number of non-search terms separating the first and last search terms in a match. In addition, an optional parameter can be used to specify an exact order match.

In a previous article, I created a FileTable and setup a Full-Text Catalog on it. I'll use that example as the basis for this article, and go into a detailed discussion of the NEAR operator. Please note to execute the entire script from the FileTable article, including copying the three text files into the FileTable directory. This will be needed before progressing with the examples in this article.

Using the NEAR Operator
The NEAR operator can be used within the CONTAINS predicate of a WHERE clause or the CONTAINSTABLE function. It's used for exact or fuzzy searches of a search term in text. A search term can be either a single word or a phrase delimited by double quotation marks (i.e. "my special phrase").

Switching to the FileTable example mentioned previously, you may recall the contents of the AudiA6.txt file are:

Model Year: 2011
Mileage: 18,231
Body Style: Sedan
Engine: V6
Exterior Color: White
Interior Color: Black
Interior Material: Leather

Please note this example pertains to .TXT files which are handled by default in SQL Server. To search other file types, a filter pack may need to be installed. Click here for more information on installing filter packs.

Looking at the example below, the WHERE clause utilizes the CONTAINS predicate to search the column named file_stream for the words "V6" and "Leather". Both terms must exist in the column in order for a record to match. Assuming the demo steps mentioned in the FileTables article were executed correctly, the following query should return a record, for the file AudiA6.txt.

Select name as File_Name
from dbo.CarsDocStore DS
where contains(file_stream, 'near(V6, Leather)')
Go

File_Name
AudiA6.txt

The NEAR operator can accept multiple search terms. However, all terms must exist in a table column for a match. If any of the terms don't exist, no match will be found. The order and case of the search parameters is irrelevant. The following statement searches for four terms and finds the same matching record as the previous example (for AudiA6.txt):

Select name as File_Name
from dbo.CarsDocStore DS
where contains(file_stream, 'near(v6, BLACK, leather, color)')
Go

File_Name
AudiA6.txt

However, the following example won't work because the first search term ("sam") doesn't exist in combination with the other terms specified:

Select name as File_Name
from dbo.CarsDocStore DS
where contains(file_stream, 'near(sam, v6, BLACK, leather, color)')
Go

When used with the CONTAINS predicate, the '~' can be substituted for the term NEAR in chaining search terms. This makes the following two statements yield the same results as the previous two examples:

Select name as File_Name
from dbo.CarsDocStore DS
where contains(file_stream, 'v6 near leather near color')
Go

Select name as File_Name
from dbo.CarsDocStore DS
where contains(file_stream, 'v6 ~ leather ~ color')
Go

Custom Proximity Term
The Custom Proximity Term is a new feature that allows the user to specify the maximum distance between search terms; it's an optional second parameter. The following example searches for the terms "Exterior" and "white", with a maximum distance of three words between them:

Select name as File_Name
from dbo.CarsDocStore DS
where contains(file_stream , 'NEAR((WHITE, exterior), 3)');
Go

File_Name
AudiA6.txt

This returns the record for the AudiA6.txt file because it contains the two search terms, even though they don't occur in the order specified and the case of "WHITE" isn't an exact match.

The NEAR operator also allows an optional third parameter, of Boolean type, used to specify exact matches or not. If I utilize this parameter in the same example, it won't return a record because the terms don't occur in the order specified.

Select count(*) as Total_Count
from dbo.CarsDocStore DS
where contains(file_stream , 'NEAR((WHITE, exterior), 3, true)');
Go

Total_Count
0

The NEAR term can also be compounded using AND(&), OR(|), or AND NOT(&!) operations, as seen in the following example. Once again, the order of the terms is irrelevant because the optional Boolean parameter is excluded.

Select name as File_Name
from dbo.CarsDocStore DS
where contains(file_stream , 'NEAR((WHITE, exterior), 3) AND NEAR ((interior, black), 3)');
Go

File_Name
AudiA6.txt

Maximum Distance
The NEAR operator allows a maximum of 64 search terms. For a match to occur, all search terms must occur within the specified maximum distance, starting from the first term to the last term.

The maximum distance also includes "Stopwords",  which are words that don't have a linguistic meaning such as "a," "and," "is," and "the", and so on. These words are excluded from the full-text index since they don't affect a search. For more information on Stopwords, please see the MSDN documentation.

If the search terms span different sentences, paragraphs or chapters, the distance used is increased by 8, 128, or 1024, respectively. Note the following example that searches for the colors "white" and "black". If you recall, these are the colors of the exterior and interior, respectively. The terms are separated by two words but located on separate lines, and considered as separate paragraphs. Because of this, the distance value had to be increased to 130 in order for a match to be found. Otherwise, a lesser value yields no match.

Select name as File_Name
from dbo.CarsDocStore DS
where contains(file_stream , 'NEAR((WHITE, black), 130)');
Go

File_Name
AudiA6.txt

Please note the Custom Proximity Term is a replacement for the old Generic Proximity Term, which is still available but will be deprecated in upcoming versions of SQL Server. Go here http://bitly.com/1nJjQw for more information on the Custom Proximity Term.

More Powerful Searching
With the inclusion of the NEAR operator in SQL Server 2012, Full-Text searching with custom proximity can be implemented for fuzzy or exact matches. Detecting the proximity of search terms aids in gathering more meaningful information from data. Combined with FileTables and filter packs, Full-Text searching is now powerful than ever before.

About the Author

Sam Nasr has been a software developer since 1995, focusing mostly on Microsoft technologies. Having achieved multiple certifications from Microsoft (MCAD, MCTS(MOSS), and MCT), Sam develops, teaches, and tours the country to present various topics in .Net Framework. He is also actively involved with the Cleveland C#/VB.Net User Group, where he has been the group leader since 2003. In addition, he also started the Cleveland WPF Users Group in June 2009, and the Cleveland .Net Study Group in August 2009, and is the INETA mentor for Ohio. When not coding, Sam loves spending time with his family and friends or volunteering at his local church. He can be reached by email at [email protected].

comments powered by Disqus

Featured

  • Hands On: New VS Code Insiders Build Creates Web Page from Image in Seconds

    New Vision support with GitHub Copilot in the latest Visual Studio Code Insiders build takes a user-supplied mockup image and creates a web page from it in seconds, handling all the HTML and CSS.

  • Naive Bayes Regression Using C#

    Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the naive Bayes regression technique, where the goal is to predict a single numeric value. Compared to other machine learning regression techniques, naive Bayes regression is usually less accurate, but is simple, easy to implement and customize, works on both large and small datasets, is highly interpretable, and doesn't require tuning any hyperparameters.

  • VS Code Copilot Previews New GPT-4o AI Code Completion Model

    The 4o upgrade includes additional training on more than 275,000 high-quality public repositories in over 30 popular programming languages, said Microsoft-owned GitHub, which created the original "AI pair programmer" years ago.

  • Microsoft's Rust Embrace Continues with Azure SDK Beta

    "Rust's strong type system and ownership model help prevent common programming errors such as null pointer dereferencing and buffer overflows, leading to more secure and stable code."

  • Xcode IDE from Microsoft Archrival Apple Gets Copilot AI

    Just after expanding the reach of its Copilot AI coding assistant to the open-source Eclipse IDE, Microsoft showcased how it's going even further, providing details about a preview version for the Xcode IDE from archrival Apple.

Subscribe on YouTube

Upcoming Training Events