Microsoft's HDinsight Big Data Service Almost Ready for Prime Time

Microsoft's forthcoming cloud-based big data service is "feature-complete" and in many regards is more advanced than rival offerings, but developers shouldn't anticipate a seamless experience. That was the assessment of Andrew Brust, CEO of Blue Badge Insights, in a session at Visual Studio Live! Chicago.

After an extended private test period that began last year, Microsoft in March released the beta of its Windows Azure HDInisght service, though testers have to pay a discounted fee, Brust told attendees at the conference Tuesday morning. HDInsight processes huge volumes of structured and unstructured data using Microsoft's SQL Server and the Hortonworks distribution of the Hadoop file system (HDFS). 

Brust said HDInisght performs well overall, but like all of the Hadoop based offerings, it does not let developers seamlessly create queries to these large file stores. "Hadoop is not yet ready for the enterprise, the tooling has a long way to go," Brust said. "Believe it or not the tooling that Microsoft gives you is superior to tools that other Hadoop vendors give you."

While data stores based on Apache Hadoop have rapidly become a popular platform for creating large data grids that can process terabytes and even petabytes of data, many implementations tend to run on Linux servers, Brust noted. That's because of the open source nature of Hadoop and the fact that MapReduce, the common parallel processing approach to reading and writing to HDFS, is Java-based.

Hortonworks has created a Hadoop distribution that can run on Windows Server and hence Windows Azure. While Hortonworks offers a Windows Server-based Hadoop distribution, Microsoft so far only has a preview of HDInsight running on Windows Azure. Brust said it's a safe bet that Microsoft will offer a Windows Server-based Hadoop offering. In the meantime, developers can access the raw Hadoop Data Platform (HDP) code.

Users can perform queries against HDInsight with their preferred BI tools, including Microsoft Excel using an ODBC driver for Hive, the Apache-based code that takes SQL Queries to generate MapReduce code. "Technically any ODBC client can query Hadoop," Brust said. "We can take Excel, make Excel talk to Hive, [which] generates the MapReduce code, runs the job, brings the data up, and surfaces it up through the ODBC driver."

Brust said the HDInsight framework also is designed to write MapReduce code in Javascript as well as write MapReduce code in Microsoft's C#. Microsoft also offers an SDK that developers can download from Codeplex, for working with HDInsight. "What that does is give you a bunch of base classes to make it even easier to write the code in C#," Brust said. Also available is a LINQ provider for Hive, he added.

Microsoft has taken the uncharacteristic step with HDInsight of contributing its work on it back to the Apache Project. "This is not Microsoft taking the Apache code and making a propriety version, this is Microsoft making a Windows version and keeping the open source code branches," Brust said. "What's interesting is other companies have taken the open source code and have built proprietary distributions, which is historically a bit of a switch."

With the current beta, anyone can sign up from the Windows Azure portal. After signing up for the preview, a tester defines the name of a cluster, choses the number of nodes, and defines a Windows Azure storage account to associate with and an administrative password. It takes about 20 minutes to create the cluster, according to Brust. In the current beta, Brust warned that testers can only use Windows Azure storage instances residing in the U.S. east data center.

The current discounted rate for the preview is whatever a customer would pay for Windows Azure compute and Windows Azure storage, but Microsoft divides the compute rate by two. So if you have a four-node cluster, you effectively are billed for just two nodes, Brust said.

While Microsoft hasn't yet announced when the HDInsight service will become generally available, Brust said he wouldn't be surprised to see it released this summer.

About the Author

Jeffrey Schwartz is editor of Redmond magazine and also covers cloud computing for Virtualization Review's Cloud Report. In addition, he writes the Channeling the Cloud column for Redmond Channel Partner. Follow him on Twitter @JeffreySchwartz.

comments powered by Disqus

Featured

  • Compare New GitHub Copilot Free Plan for Visual Studio/VS Code to Paid Plans

    The free plan restricts the number of completions, chat requests and access to AI models, being suitable for occasional users and small projects.

  • Diving Deep into .NET MAUI

    Ever since someone figured out that fiddling bits results in source code, developers have sought one codebase for all types of apps on all platforms, with Microsoft's latest attempt to further that effort being .NET MAUI.

  • Copilot AI Boosts Abound in New VS Code v1.96

    Microsoft improved on its new "Copilot Edit" functionality in the latest release of Visual Studio Code, v1.96, its open-source based code editor that has become the most popular in the world according to many surveys.

  • AdaBoost Regression Using C#

    Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the AdaBoost.R2 algorithm for regression problems (where the goal is to predict a single numeric value). The implementation follows the original source research paper closely, so you can use it as a guide for customization for specific scenarios.

  • Versioning and Documenting ASP.NET Core Services

    Building an API with ASP.NET Core is only half the job. If your API is going to live more than one release cycle, you're going to need to version it. If you have other people building clients for it, you're going to need to document it.

Subscribe on YouTube