SQL or No SQL for Big Data in the Cloud?
The relational model and SQL dominate today's database landscape. But the Web's "big data" revolution is forcing architects and programmers to consider newer, unfamiliar models when designing and implementing Web-scale applications. Before diving into comparing data models, it's a good idea to agree on a definition of "big data." Adam Jacobs, a senior software engineer at 1010data Inc., describes it as "data whose size forces us to look beyond the tried-and-true methods that are prevalent at that time" in his article, "The Pathologies of Big Data," in the July 2009 issue of ACM Queue. Jacobs explains:
In the early 1980s, it was a dataset that was so large that a robotic "tape monkey" was required to swap thousands of tapes in and out. In the 1990s, perhaps, it was any data that transcended the bounds of Microsoft Excel and a desktop PC, requiring serious software on Unix workstations to analyze. Nowadays, it may mean data that is too large to be placed in a relational database and analyzed with the help of a desktop statistics/visualization package -- data, perhaps, whose analysis requires massively parallel software running on tens, hundreds, or even thousands of servers.
The performance of SQL queries against relational tables decreases as the number of rows increases, leading to a requirement for partitioning, usually by a process called sharding. In the case of SQL Azure, a cloud-based database based on SQL Server 2008 R2, the maximum database size is 50GB, so individual partitions (shards) of 50GB or less in size are linked by a process called federation.
Not Only SQL
Some developers are turning to NoSQL, said to stand for "not SQL" or "not only SQL," to model data for Web-scale applications. NoSQL includes non-relational databases in the following categories:
Key/Value, also called Entity/Attribute/Value (EAV), stores are schema-less collections of entities that don't need the same properties. Windows Azure Table Storage and Amazon Web Services SimpleDB are proprietary, cloud-based examples, while BerkeleyDB, ReDis and MemcacheDB are open source implementations.
Column, also called column-family or wide-column, stores define columns in a configuration file and hold column families in rows accessed by a key value. Most column stores are modeled on Google BigTable architecture. Popular open source column-family implementations are Cassandra and Hypertable. Steve Marx, a Microsoft technical strategist on Windows Azure, has a live demonstration of Cassandra running in a Windows Azure project.
Graph stores, or graph databases, are an emerging NoSQL category based on graph theory that use nodes (standalone objects or entities) and edges (lines used to connect nodes and properties) to represent and store information. GraphDB is an open source graph store written in C# that can run in a Windows Azure Worker Role. Microsoft Research offers academic and commercial licenses for community technology previews of its Dryad distributed graph store and DryadLINQ parallel query language, both of which are destined for deployment in Windows Azure. Neo4j is a popular open source graph database for Java.
The head of the Microsoft Cloud Programmability Team, Erik Meijer, whom I call the "father of LINQ," and Gavin Bierman, a senior researcher at Microsoft Research, Cambridge, propose to drain the gulf between the relational and NoSQL data models. In the March 2011 ACM Queue article, "A Co-Relational Model of Data for Large Shared Data Banks", the coauthors apply category theory to prove that "the NoSQL category is the dual of the SQL category -- NoSQL is really coSQL." They assert:
The implication of this duality is that coSQL and SQL are not in conflict, like good and evil. Instead they are two opposites that coexist in harmony and can transmute into each other like yin and yang. Interestingly, in Chinese philosophy yin symbolizes open and hence corresponds to the open world of coSQL, and yang symbolizes closed and hence corresponds to the closed world of SQL.
Meijer and Bierman conclude, "Because of the common query language [LINQ] based on monads, both can be implemented using the same principles."