Q&A

Data Virtualization in SQL Server 2022

PolyBase, a data virtualization feature for SQL Server, allows users to seamlessly query data from various external sources directly using T-SQL (Transact-SQL) without the need for separate client connection software.

"A key use case for data virtualization with the PolyBase feature is to allow the data to stay in its original location and format," says Microsoft. "You can virtualize the external data through the SQL Server instance, so that it can be queried in place like any other table in SQL Server. This process minimizes the need for ETL processes for data movement. This data virtualization scenario is possible with the use of PolyBase connectors."

First introduced with SQL Server 2016, PolyBase has evolved over the years and with the latest version, SQL Server 2022, offers new features and enhancements, summarized in this table:

PolyBase in SQL Server 2022
[Click on image for larger view.] PolyBase in SQL Server 2022 (source: Microsoft).

To help developers and database administrators understand the mechanics of PolyBase and how to write T-SQL queries to access data from heterogeneous sources, Kevin Feasel, proprietor of Catallaxy Services, LLC, will present a session titled "Data Virtualization in SQL Server 2022" at the upcoming Visual Studio Live! developer conference taking place Aug. 5-9 at Microsoft Headquarters in Redmond, Wash.

"PolyBase is Microsoft's primary data virtualization technology, allowing you to integrate SQL Server with a variety of external systems such as Oracle, MongoDB, Cosmos DB, Teradata, Apache Spark, and even other SQL Server instances," Feasel said. "In this talk, we will walk through the mechanics of PolyBase, showing how we can write T-SQL queries to access data from heterogeneous sources. We will see how data virtualization in SQL Server has evolved over the years and the new capabilities you will find in SQL Server 2022. Finally, we will look at several interesting use cases for data virtualization in the modern data environment."

He promises attendees will learn:

  • What data virtualization is and its history in SQL Server
  • The three key objects for data virtualization: external data sources, file formats, and tables
  • How to connect SQL Server to a wide variety of data platform technologies

We caught up with the Microsoft Data Platform MVP, who specializes in "data analytics with T-SQL and R, forcing Spark clusters to do his bidding, fighting with Kafka, and pulling rabbits out of hats on demand," to learn more about his 75-minute, introductory/intermediate session.

VisualStudioMagazine: What inspired you to present a session on this topic?
Feasel: I have been a fan of PolyBase since its first mainstream release in SQL Server 2016. The idea of interacting with Hadoop (and Azure Blob Storage) using T-SQL statements was great. My inspiration for presenting a session on the topic is that I want to share some of this passion, as well as letting people know what has changed in SQL Server 2022 and how things work now.

Can you briefly explain the concept of data virtualization in SQL Server and its significance in modern data management?
The core of data virtualization is the idea that end users really don't care where the data lives. They want the ability to "connect the dots" regardless of whether the data is in SQL Server, PostgreSQL, MongoDB, or some other source.

"With PolyBase, SQL Server offers that capability, exposing data in systems as though they were simply local SQL Server tables and letting you use the same set of T-SQL functionality you already know."

Kevin Feasel, Proprietor, Catallaxy Services, LLC

With PolyBase, SQL Server offers that capability, exposing data in systems as though they were simply local SQL Server tables and letting you use the same set of T-SQL functionality you already know.

How has data virtualization evolved in SQL Server up to the 2022 version, and what are the key improvements?
In SQL Server 2016, we saw the first release of PolyBase in the mainline SQL Server product. This supported access to Hadoop and Azure Blob Storage. With SQL Server 2019, the focus changed to emphasize ODBC connections, opening the door to a variety of other data platforms, like Oracle, MongoDB, and Teradata, among many others. Now, with SQL Server 2022, we see an emphasis on APIs, supporting Azure Blob Storage, Azure Data Lake Storage Gen2, AWS S3, and more. And with a market shift away from classic Hadoop instances, Microsoft eliminated support for Hadoop and its requirement of installing Java on your SQL Server instance.

Inside the Session

What: Data Virtualization in SQL Server 2022

When: Aug. 8, 2024, 8 a.m. - 9:15 a.m.

Who: Kevin Feasel, Proprietor, Catallaxy Services, LLC

Why: Learn the mechanics of PolyBase and how to write T-SQL queries to access data from heterogeneous sources

Find out more about Visual Studio Live! taking place Aug. 5-9 at Microsoft Headquarters in Redmond, Wash.

Are there any limitations or challenges in implementing data virtualization in SQL Server 2022 that organizations should be aware of?
Absolutely. Any data virtualization technology runs the risk of being slow, especially when joining together data from disparate sources. There are some ways to mitigate this, as we talk about in the session, but the core trade-off in data virtualization platforms is risk of slow results or data latency, where results are out of date.

What are the prerequisites for someone looking to implement data virtualization in SQL Server 2022, and what level of expertise is required?
This is definitely something for database administrators to set up. PolyBase itself doesn't have particularly onerous requirements, but there are a few T-SQL constructs that are new to users, including CREATE EXTERNAL DATA SOURCE, CREATE EXTERNAL FILE FORMAT, and CREATE EXTERNAL TABLE. The exact syntax for how to set up an external data source will differ based on the data source, and it can be a challenge working through integration problems. For that reason, I'd recommend people with a bit more experience administering databases try this out, versus someone brand new to SQL Server.

For attendees looking to learn more about data virtualization after your talk, what resources or next steps would you recommend?
For a detailed look at PolyBase as of SQL Server 2019, I authored the book PolyBase Revealed. It was the first book-length analysis of PolyBase and its capabilities and holds up fairly well for people not yet on SQL Server 2022. For people using 2022, most chapters are still relevant, though not the ones covering Hadoop and Azure Blob Storage.

Note: Those wishing to attend the training seminar can save hundreds of dollars by registering early according to the event's pricing info. "Save $400 when you register by June 7," said the organizer of the event, which is being presented by the parent company of Visual Studio Magazine.

About the Author

David Ramel is an editor and writer at Converge 360.

comments powered by Disqus

Featured

Subscribe on YouTube