In-Depth

Build Big-Data Apps in SQL Azure with Federation

Get ready to scale out SQL Azure databases beyond today's 50GB limit with the Transact-SQL and ADO.NET elastic sharding features, which are coming in the 2011 SQL Azure Federation Community Technology Previews.

An urban myth that relational databases and SQL can't achieve Internet-scale of terabytes -- or petabytes -- has fostered a growing "NoSQL" developer community and a raft of new entity-attribute-value, also known as key value, data stores. The Microsoft SQL Azure team gave credence to the myth by limiting the maximum size of the initial Business Edition of the cloud database to just 10GB. SQL Azure adds a pair of secondary data replicas to assure high availability, and the team cited performance issues with replication as the early size-limiting factor.

Today, you can rent a 50GB SQL Azure database for $499.95 per month, but the SQL Azure Team isn't talking publicly about future scale-up options. Instead, Microsoft recommends that you scale out your SQL Azure databases by partitioning them horizontally into smaller instances, called shards, running on individual SQL Azure database instances, and group the shards into federations.

Sharding increases database capacity and query performance, because each added SQL Azure database brings its own memory and virtual CPU. Microsoft Software Architect Lee Novik first described SQL Azure sharding details in his "Building Scale-Out Applications with SQL Azure" session at the Microsoft Professional Developers Conference 2010, held last October in Redmond.

Horizontal partitioning isn't new to SQL Server. Horizontally partitioning SQL Server 7.0 and later tables to multiple files and filegroups improves performance by reducing average table and index size. Placing each filegroup on an individual disk drive speeds T-SQL queries. Partitioning also streamlines backup and maintenance operations by reducing their time-window length.

SQL Server 2005 automated the process with the CREATE PARTITION FUNCTION command, which lets you automatically map the rows of a table or index into specified partitions based on the values of a specified column. You design a CREATE PARTITION SCHEME to determine how to assign partitioned files to filegroups. SQL Server partitioned views make rows of all partitions appear as a single table; distributed partitioned views enable partitioning data across multiple linked servers, not just filegroups, for scaling out. SQL Server 2000 introduced updateable distributed views with distributed transactions, and SQL Server 2000 SP3 used OLE DB to optimize query-execution plans for distributed partitioned views.

A group of linked servers that participates in distributed partitioned views is called a "federation." The partitioning column, whose values determine the partition to which the row belongs, must be part of the primary key and can't be an identity, timestamp or default column.

New Taxonomy
Scaling out SQL Azure with federated database instances follows a pattern similar to that for on-premises SQL Server, but is subject to several important limitations. For example, SQL Azure doesn't support linked servers, CREATE PARTITION FUNCTION, CREATE PARTITION SCHEME, cross-database joins, distributed (cross-database) transactions, OLE DB or the T-SQL NewSequentialID function. These restrictions require architectural changes for SQL Azure federations, starting with this new taxonomy:

  • Federation consists of a collection of all SQL Azure database instances that contain partitioned data having the same schema. The T-SQL script in "How to Create a Federation with Customers, Orders and OrderItems Tables" shows a T-SQL script to create an Orders_Federation with a schema based on three tables of the Northwind sample database.
  • Federation Members comprise the collection of SQL Azure databases that contain related tables with partitioned data, called Federated Tables. A Federation Member also can contain replicated lookup tables (Products) that provide supplementary data that's not dependent on the Federation Key.
  • Federation Key is the primary key value (CustomerID) that determines how data is partitioned among Federated Tables, each of which must contain the Federation Key in their primary key, which can be a big integer (bigint, a 64-bit signed integer) or GUID (uniqueidentifier) data type. For example, the Orders and OrderItems tables have composite primary keys (OrderID + CustomerID and OrderID + ProductID + CustomerID, respectively).
  • Atomic Unit (AU)is a cluster of a single parent table (Customers) row and all related rows of its dependent tables (Orders and OrderItems). AU clusters can't be separated in the partitioning (sharding) process or when moving data between Federation Members.
  • Federation Root is the initial database that contains metadata for specifying the Partitioning (sharding) Method, range of valid values for the Federation Key, and minimum/maximum Federation Key value ranges for each Federation Member.
  • Partitioning Method determines whether the Federation Key is generated by the application or the data tier. For this article's example, the data tier uniqueidentifier data type provides random 128-bit (16-byte) GUID values, which balance additions across multiple Federation Members. Sequential bigint values are easier to read, but require a feature similar to the SQL Server Denali Sequence object to generate identity values that are unique over multiple Federated Tables.

The SQL Azure Team plans to release SQL Azure Federation features in piecemeal fashion starting with a Community Technology Preview (CTP) of version 1 in 2011. The current plan is for the CTP 1 to support partitioning by uniqueidentifier FederationKey values only; a post-CTP 1 drop will add bigint FederationKeys (see Table 1).

New T-SQL Key Words for Scripting SQL Azure Federation
The following new T-SQL syntax will support SQL Azure Federations in the version 1 CTP (see "How to Create a Federation with Customers, Orders and OrderItems Tables"):

CREATE FEDERATION Federation_Name(Federation_Key_Name RANGE uniqueidentifier)
DROP FEDERATION Federation_Name
CREATE TABLE Table_Name FEDERATE ON (Federation_Key_Name) 
USE FEDERATION Federation_Name (Fed_Key_Value) WITH FILTERING=ON
USE FEDERATION Federation_Name (Fed_Key_Value) WITH FILTERING=OFF
ALTER FEDERATION Federation_Name SPLIT AT(Fed_Key_Value)
ALTER FEDERATION Orders_Federation DROP AT(Fed_Key_Value)

The FILTERING=ON option restricts the visibility of AUs to the specific cluster specified by Fed_Key_Value; FILTERING=OFF makes all AUs in the Federation Member visible for bulk operations. The ALTER FEDERATION ... SPLIT AT Fed_Key_Value instruction lets you balance the size of Federation Members that have grown substantially larger than the average by moving AUs greater than Fed_Key_Value into a new Federation Member. SPLIT AT and DROP AT operations work with AUs exclusively.

The version 1 CTP ADO.NET sharding library will include sample Microsoft .NET Framework-based code for connecting to the Federation Root, which automatically routes the connection to the appropriate Federation Member based on information from the sys.federation_member_columns view, and retrieves AUs as ADO.NET DataSets. Here's the generic code for instructions that open a connection to a Federation Member whose member_id specified by dbname_postfix:

SqlConnection cn = new
SqlConnection("Server=tcp:servername.db.windows.net;" +
  "Db=SalesDB_" + dbname_postfix +
  ";User ID=username;Password=password;" +
  "Trusted_Connection=False;Encrypt=True");
cn.Open();
...

The code in Listing 1 automatically routes the connection to the appropriate Federation Member for a particular FederationKey value by connecting to the Federation Root database and executing a T-SQL USE FEDERATION FederationName(FederationKey) instruction (see "How to Get Data from the Appropriate Atomic Unit in a Federation Member").

At this point, you can issue parameterized select queries with CommandText such as SELECT * FROM Orders WHERE (CustomerId=@customerId). Note that SQL Azure supports USE FEDERATION but not USE DatabaseName instructions.

.NET projects based on CTP version 1 that require rows from more than one shard to populate a DataSet from a SELECT query will require client-side code that executes UNION ALL queries to aggregate the AUs. Obtaining aggregate values such as SUM(), AVG(), MIN(), MAX() and COUNT() with CTP version 1 also will require multiple queries that increment accumulator variables with values from individual Federation Members.

Automagical Online Partitioning and Other New Features
A primary SQL Azure selling point is minimizing -- and ultimately eliminating -- all common physical database management tasks by subscribers. The SQL Azure team wants to manage scaling federations out and in, as well as leveling the sizes of Federation Member databases.

According to Lev Novik and other SQL Azure team members, CTP versions later than version 1 will add at least the following features to SQL Azure Federations:

  • A new ALTER FEDERATION Federation_Name MERGE (Fed_Key_Value) instruction to enable repartitioning when scale-down is appropriate for reducing database subscription expense.
  • Auto repartitioning to enable SQL Azure to manage repartitioning the federated databases online with SPLIT and MERGE operations based on a policy you specify, such as Federated Table size or query execution time.
  • Fan-out queries so that a single query will be able to process results gathered across many Federation Members.
  • Schema management to allow multiversion schema deployment and management across Federation Members.
  • Multicolumn Federation Keys to enable federation on, for example, CustomerID+AccountID.

Java developers have had the Hibernate.Shards library for several years; it's now matured to version 3.0.0 beta 2 on the JBoss.org Web site. Developer Oren Eini, more commonly known by the alias "Ayende Rahien," rejuvenated the NHibernate Shard project for .NET last fall. SharpFellow blogger John Rayner demonstrated NHibernate Shards with SQL Azure and provided source code in his "Sharding into the Cloud" post. Rayner wrote in his post: "NHibernate.Shards is an extension to the well-known O/RM, which allows a logical database to be partitioned across multiple physical databases and servers. It's a port of the Hibernate.Shards project, as with lots of things in NHibernate. I thought it would be interesting to see how well it worked against SQL Azure. It turned out to be not interesting at all ... just plain easy!"

Microsoft has most of its object/relational mapping (O/RM) eggs in the Entity Framework basket. You can expect the teams that own data connectivity to SQL Server and SQL Azure -- including the ADO.NET Team that's responsible for Entity Framework -- to expand their repertoire to handle SQL Azure Federation. The Microsoft

Azure Marketplace DataMarket is a classic big-data application, so it's a good bet that Pablo Castro (a Microsoft software architect in the SQL Server Group) and his WCF Data Services team are hard at work updating the RESTful OData API. Synchronizing SQL Azure Federations across datacenter boundaries (geolocations) and with on-premises SQL Server databases will keep Liam Cavanagh (Microsoft senior program manager for SQL Azure Data Sync and Microsoft Sync Framework) and his team busy for at least a few months. Watch these groups' blogs closely for signs of progress on Federation-enabled connectivity for SQL Azure.

Don't believe everything you hear from the "NoSQL" crowd about the demise of SQL databases for big data. Microsoft intends to protect its database turf in the cloud from naysayers who claim it won't scale past 50GB. The company's approach is to automate the sharding process and its management for scaling out and back in to minimize human intervention. Get ready to take advantage of SQL Azure Federations by downloading the CTP 1 as soon as it's available, and then plan your route to high-scalability relational data nirvana.

Credits: Thanks to Cihan Biyikoglu, program manager for SQL Azure, for providing important technical insights about SQL Azure Federations and the forthcoming CTPs; visit his "SQL Azure -- Your Data in the Cloud" blog for up-to-date information on SQL Azure Federation developments. You'll find additional references to his posts, as well as other sharding approaches for SQL Azure, Hibernate and NHibernate, in the OakLeaf Systems Resource Links for SQL Azure Federations and Sharding Topics post.

comments powered by Disqus

Featured

  • AI for GitHub Collaboration? Maybe Not So Much

    No doubt GitHub Copilot has been a boon for developers, but AI might not be the best tool for collaboration, according to developers weighing in on a recent social media post from the GitHub team.

  • Visual Studio 2022 Getting VS Code 'Command Palette' Equivalent

    As any Visual Studio Code user knows, the editor's command palette is a powerful tool for getting things done quickly, without having to navigate through menus and dialogs. Now, we learn how an equivalent is coming for Microsoft's flagship Visual Studio IDE, invoked by the same familiar Ctrl+Shift+P keyboard shortcut.

  • .NET 9 Preview 3: 'I've Been Waiting 9 Years for This API!'

    Microsoft's third preview of .NET 9 sees a lot of minor tweaks and fixes with no earth-shaking new functionality, but little things can be important to individual developers.

  • Data Anomaly Detection Using a Neural Autoencoder with C#

    Dr. James McCaffrey of Microsoft Research tackles the process of examining a set of source data to find data items that are different in some way from the majority of the source items.

  • What's New for Python, Java in Visual Studio Code

    Microsoft announced March 2024 updates to its Python and Java extensions for Visual Studio Code, the open source-based, cross-platform code editor that has repeatedly been named the No. 1 tool in major development surveys.

Subscribe on YouTube