Redmond Review
Will Big Data Be Big with Developers?
.NET developers are database developers. Whether using ADO.NET, the Entity Framework or data binding, .NET devs work with transactional data as a matter of course.
.NET developers are database developers. Whether using ADO.NET, the Entity Framework or data binding, .NET devs work with transactional data as a matter of course. But data analytics work is another matter. In fact, very few application and enterprise developers do analytics. Can Microsoft change that?
Microsoft opened the big data world to its ecosystem about a year ago with the announcement of its "Project Isotope" Hadoop on Windows initiative. A year later, though still in preview form, the technology has a brand (HDInsight) and significant integration with .NET and Visual Studio, and is clearly strategic to Microsoft. And developers are in the crosshairs: HDInsight was featured at BUILD, the flagship Microsoft developer conference.
Why does Microsoft think developers will take to analytics with big data now, when they didn't do so with business intelligence (BI) before? And given the overwhelming orientation of the big data world to Linux and Java, how does Microsoft expect to succeed in the space with Windows and .NET? At first glance, this looks to be a fool's errand. Is Microsoft naïve and tone deaf, or is it on to something?
Unboxing HDInsight
Before we judge whether developers will flock to HDInsight or shun it, let's get a sense of what the product is and what developer tools it features. HDInsight is based on the open source Apache Hadoop project, which provides processing and analysis of huge data sets (up to petabyte scale) by distributing the storage and compute workloads across numerous servers in a cluster. While this may sound straightforward -- and similar in principle to products such as SQL Server Parallel Data Warehouse (PDW) -- Hadoop can be pretty hard to work with.
Hadoop is natively queried through imperative Java code, using a two-pass approach called MapReduce. In this framework, a Map function first preprocesses the data, and a Reduce function then aggregates it. Multiple Mappers run in parallel across various nodes in the cluster, passing their output to multiple Reducer nodes to finish the work, also in parallel. A component included in most Hadoop distributions (including HDInsight) called "Pig" provides a data transformation language abstraction layer over Java-based MapReduce code. "Hive," another such component, provides a SQL-like abstraction over it.
What does Microsoft bring to the party? With HDInsight, developers can write MapReduce code in C# instead of Java, or use a LINQ provider to manipulate MapReduce indirectly through Hive. A NuGet package provides the C# MapReduce support, and a single-node developer version of HDInsight allows local debugging of such code in Visual Studio. A command-line utility provides deployment of the assembly to the local Hadoop instance. Deployment directly from Visual Studio to remote clusters, including the Windows Azure HDInsight implementation, seems a safe bet for future releases.
Bringing Hadoop to Windows (including to developers' own PCs) and providing integration and debugging support for C# and LINQ is a neat trick. It goes a long way toward making Hadoop an enterprise developer-friendly technology. Microsoft's alternate JavaScript-based framework for MapReduce code makes it friendly to Node.js and JavaScript developers, too. But will HDInsight appeal to the Linux- and Java-focused big data pros out there? Probably not, but therein lies the real value.
A Bigger Tent
Big data is a huge industry phenomenon right now, but the "data scientists" and MapReduce developers that enable its implementation are an exclusive bunch. These professionals are in short supply, and they don't come cheap. In other words, big data is a specialty at the height of its hype cycle, ripe for disruption.
We've seen this move before. Microsoft democratized Windows development with Visual Basic, enterprise development with .NET, relational database development with SQL Server and BI with a combination of that product plus SharePoint and Office. Every time Microsoft has disrupted an elite specialization, it's done so with devel- opers in its ecosystem. Now it's trying again with big data and HDInsight.
Hadoop is different from past disrupted areas, though, because it's already developer-focused. But the developers who typify the Hadoop faithful right now work in lab environments -- whether in academic organizations, big Internet companies or startups. Even in the enterprise, big data practitioners work in lab-like organizations; they're not, by and large, typical developers from IT and business units.
But for big data to be big, that needs to change; the skill set needs to be ubiquitous and mainstream. Business developers are database developers. Microsoft thinks they can be big data developers too. And if they're also Windows client/Phone/Server/Azure developers, that would be "big" for Redmond, indeed.
About the Author
Andrew Brust is Research Director for Big Data and Analytics at Gigaom Research. Andrew is co-author of "Programming Microsoft SQL Server 2012" (Microsoft Press); an advisor to NYTECH, the New York Technology Council; co-moderator of Big On Data - New York's Data Intelligence Meetup; serves as Microsoft Regional Director and MVP; and is conference co-chair of Visual Studio Live!