Microsoft's HDinsight Big Data Service Almost Ready for Prime Time

Microsoft's forthcoming cloud-based big data service is "feature-complete" and in many regards is more advanced than rival offerings, but developers shouldn't anticipate a seamless experience. That was the assessment of Andrew Brust, CEO of Blue Badge Insights, in a session at Visual Studio Live! Chicago.

After an extended private test period that began last year, Microsoft in March released the beta of its Windows Azure HDInisght service, though testers have to pay a discounted fee, Brust told attendees at the conference Tuesday morning. HDInsight processes huge volumes of structured and unstructured data using Microsoft's SQL Server and the Hortonworks distribution of the Hadoop file system (HDFS). 

Brust said HDInisght performs well overall, but like all of the Hadoop based offerings, it does not let developers seamlessly create queries to these large file stores. "Hadoop is not yet ready for the enterprise, the tooling has a long way to go," Brust said. "Believe it or not the tooling that Microsoft gives you is superior to tools that other Hadoop vendors give you."

While data stores based on Apache Hadoop have rapidly become a popular platform for creating large data grids that can process terabytes and even petabytes of data, many implementations tend to run on Linux servers, Brust noted. That's because of the open source nature of Hadoop and the fact that MapReduce, the common parallel processing approach to reading and writing to HDFS, is Java-based.

Hortonworks has created a Hadoop distribution that can run on Windows Server and hence Windows Azure. While Hortonworks offers a Windows Server-based Hadoop distribution, Microsoft so far only has a preview of HDInsight running on Windows Azure. Brust said it's a safe bet that Microsoft will offer a Windows Server-based Hadoop offering. In the meantime, developers can access the raw Hadoop Data Platform (HDP) code.

Users can perform queries against HDInsight with their preferred BI tools, including Microsoft Excel using an ODBC driver for Hive, the Apache-based code that takes SQL Queries to generate MapReduce code. "Technically any ODBC client can query Hadoop," Brust said. "We can take Excel, make Excel talk to Hive, [which] generates the MapReduce code, runs the job, brings the data up, and surfaces it up through the ODBC driver."

Brust said the HDInsight framework also is designed to write MapReduce code in Javascript as well as write MapReduce code in Microsoft's C#. Microsoft also offers an SDK that developers can download from Codeplex, for working with HDInsight. "What that does is give you a bunch of base classes to make it even easier to write the code in C#," Brust said. Also available is a LINQ provider for Hive, he added.

Microsoft has taken the uncharacteristic step with HDInsight of contributing its work on it back to the Apache Project. "This is not Microsoft taking the Apache code and making a propriety version, this is Microsoft making a Windows version and keeping the open source code branches," Brust said. "What's interesting is other companies have taken the open source code and have built proprietary distributions, which is historically a bit of a switch."

With the current beta, anyone can sign up from the Windows Azure portal. After signing up for the preview, a tester defines the name of a cluster, choses the number of nodes, and defines a Windows Azure storage account to associate with and an administrative password. It takes about 20 minutes to create the cluster, according to Brust. In the current beta, Brust warned that testers can only use Windows Azure storage instances residing in the U.S. east data center.

The current discounted rate for the preview is whatever a customer would pay for Windows Azure compute and Windows Azure storage, but Microsoft divides the compute rate by two. So if you have a four-node cluster, you effectively are billed for just two nodes, Brust said.

While Microsoft hasn't yet announced when the HDInsight service will become generally available, Brust said he wouldn't be surprised to see it released this summer.

About the Author

Jeffrey Schwartz is editor of Redmond magazine and also covers cloud computing for Virtualization Review's Cloud Report. In addition, he writes the Channeling the Cloud column for Redmond Channel Partner. Follow him on Twitter @JeffreySchwartz.

comments powered by Disqus

Featured

  • Creating Reactive Applications in .NET

    In modern applications, data is being retrieved in asynchronous, real-time streams, as traditional pull requests where the clients asks for data from the server are becoming a thing of the past.

  • AI for GitHub Collaboration? Maybe Not So Much

    No doubt GitHub Copilot has been a boon for developers, but AI might not be the best tool for collaboration, according to developers weighing in on a recent social media post from the GitHub team.

  • Visual Studio 2022 Getting VS Code 'Command Palette' Equivalent

    As any Visual Studio Code user knows, the editor's command palette is a powerful tool for getting things done quickly, without having to navigate through menus and dialogs. Now, we learn how an equivalent is coming for Microsoft's flagship Visual Studio IDE, invoked by the same familiar Ctrl+Shift+P keyboard shortcut.

  • .NET 9 Preview 3: 'I've Been Waiting 9 Years for This API!'

    Microsoft's third preview of .NET 9 sees a lot of minor tweaks and fixes with no earth-shaking new functionality, but little things can be important to individual developers.

  • Data Anomaly Detection Using a Neural Autoencoder with C#

    Dr. James McCaffrey of Microsoft Research tackles the process of examining a set of source data to find data items that are different in some way from the majority of the source items.

Subscribe on YouTube