Redmond Review

Will Big Data Be Big with Developers?

.NET developers are database developers. Whether using ADO.NET, the Entity Framework or data binding, .NET devs work with transactional data as a matter of course.

.NET developers are database developers. Whether using ADO.NET, the Entity Framework or data binding, .NET devs work with transactional data as a matter of course. But data analytics work is another matter. In fact, very few application and enterprise developers do analytics. Can Microsoft change that?

Microsoft opened the big data world to its ecosystem about a year ago with the announcement of its "Project Isotope" Hadoop on Windows initiative. A year later, though still in preview form, the technology has a brand (HDInsight) and significant integration with .NET and Visual Studio, and is clearly strategic to Microsoft. And developers are in the crosshairs: HDInsight was featured at BUILD, the flagship Microsoft developer conference.

Why does Microsoft think developers will take to analytics with big data now, when they didn't do so with business intelligence (BI) before? And given the overwhelming orientation of the big data world to Linux and Java, how does Microsoft expect to succeed in the space with Windows and .NET? At first glance, this looks to be a fool's errand. Is Microsoft naïve and tone deaf, or is it on to something?

Unboxing HDInsight
Before we judge whether developers will flock to HDInsight or shun it, let's get a sense of what the product is and what developer tools it features. HDInsight is based on the open source Apache Hadoop project, which provides processing and analysis of huge data sets (up to petabyte scale) by distributing the storage and compute workloads across numerous servers in a cluster. While this may sound straightforward -- and similar in principle to products such as SQL Server Parallel Data Warehouse (PDW) -- Hadoop can be pretty hard to work with.

Hadoop is natively queried through imperative Java code, using a two-pass approach called MapReduce. In this framework, a Map function first preprocesses the data, and a Reduce function then aggregates it. Multiple Mappers run in parallel across various nodes in the cluster, passing their output to multiple Reducer nodes to finish the work, also in parallel. A component included in most Hadoop distributions (including HDInsight) called "Pig" provides a data transformation language abstraction layer over Java-based MapReduce code. "Hive," another such component, provides a SQL-like abstraction over it.

What does Microsoft bring to the party? With HDInsight, developers can write MapReduce code in C# instead of Java, or use a LINQ provider to manipulate MapReduce indirectly through Hive. A NuGet package provides the C# MapReduce support, and a single-node developer version of HDInsight allows local debugging of such code in Visual Studio. A command-line utility provides deployment of the assembly to the local Hadoop instance. Deployment directly from Visual Studio to remote clusters, including the Windows Azure HDInsight implementation, seems a safe bet for future releases.

Bringing Hadoop to Windows (including to developers' own PCs) and providing integration and debugging support for C# and LINQ is a neat trick. It goes a long way toward making Hadoop an enterprise developer-friendly technology. Microsoft's alternate JavaScript-based framework for MapReduce code makes it friendly to Node.js and JavaScript developers, too. But will HDInsight appeal to the Linux- and Java-focused big data pros out there? Probably not, but therein lies the real value.

A Bigger Tent
Big data is a huge industry phenomenon right now, but the "data scientists" and MapReduce developers that enable its implementation are an exclusive bunch. These professionals are in short supply, and they don't come cheap. In other words, big data is a specialty at the height of its hype cycle, ripe for disruption.

We've seen this move before. Microsoft democratized Windows development with Visual Basic, enterprise development with .NET, relational database development with SQL Server and BI with a combination of that product plus SharePoint and Office. Every time Microsoft has disrupted an elite specialization, it's done so with devel- opers in its ecosystem. Now it's trying again with big data and HDInsight.

Hadoop is different from past disrupted areas, though, because it's already developer-focused. But the developers who typify the Hadoop faithful right now work in lab environments -- whether in academic organizations, big Internet companies or startups. Even in the enterprise, big data practitioners work in lab-like organizations; they're not, by and large, typical developers from IT and business units.

But for big data to be big, that needs to change; the skill set needs to be ubiquitous and mainstream. Business developers are database developers. Microsoft thinks they can be big data developers too. And if they're also Windows client/Phone/Server/Azure developers, that would be "big" for Redmond, indeed.

About the Author

Andrew Brust is Research Director for Big Data and Analytics at Gigaom Research. Andrew is co-author of "Programming Microsoft SQL Server 2012" (Microsoft Press); an advisor to NYTECH, the New York Technology Council; co-moderator of Big On Data - New York's Data Intelligence Meetup; serves as Microsoft Regional Director and MVP; and is conference co-chair of Visual Studio Live!

comments powered by Disqus

Featured

  • Hands On: New VS Code Insiders Build Creates Web Page from Image in Seconds

    New Vision support with GitHub Copilot in the latest Visual Studio Code Insiders build takes a user-supplied mockup image and creates a web page from it in seconds, handling all the HTML and CSS.

  • Naive Bayes Regression Using C#

    Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the naive Bayes regression technique, where the goal is to predict a single numeric value. Compared to other machine learning regression techniques, naive Bayes regression is usually less accurate, but is simple, easy to implement and customize, works on both large and small datasets, is highly interpretable, and doesn't require tuning any hyperparameters.

  • VS Code Copilot Previews New GPT-4o AI Code Completion Model

    The 4o upgrade includes additional training on more than 275,000 high-quality public repositories in over 30 popular programming languages, said Microsoft-owned GitHub, which created the original "AI pair programmer" years ago.

  • Microsoft's Rust Embrace Continues with Azure SDK Beta

    "Rust's strong type system and ownership model help prevent common programming errors such as null pointer dereferencing and buffer overflows, leading to more secure and stable code."

  • Xcode IDE from Microsoft Archrival Apple Gets Copilot AI

    Just after expanding the reach of its Copilot AI coding assistant to the open-source Eclipse IDE, Microsoft showcased how it's going even further, providing details about a preview version for the Xcode IDE from archrival Apple.

Subscribe on YouTube

Upcoming Training Events