Microsoft's HDinsight Big Data Service Almost Ready for Prime Time
Microsoft's forthcoming cloud-based big data service is "feature-complete" and in many regards is more advanced than rival offerings, but developers shouldn't anticipate a seamless experience. That was the assessment of Andrew Brust, CEO of Blue Badge Insights, in a session at Visual Studio Live! Chicago.
After an extended private test period that began last year, Microsoft in March released the beta of its Windows Azure HDInisght service, though testers have to pay a discounted fee, Brust told attendees at the conference Tuesday morning. HDInsight processes huge volumes of structured and unstructured data using Microsoft's SQL Server and the Hortonworks distribution of the Hadoop file system (HDFS).
Brust said HDInisght performs well overall, but like all of the Hadoop based offerings, it does not let developers seamlessly create queries to these large file stores. "Hadoop is not yet ready for the enterprise, the tooling has a long way to go," Brust said. "Believe it or not the tooling that Microsoft gives you is superior to tools that other Hadoop vendors give you."
While data stores based on Apache Hadoop have rapidly become a popular platform for creating large data grids that can process terabytes and even petabytes of data, many implementations tend to run on Linux servers, Brust noted. That's because of the open source nature of Hadoop and the fact that MapReduce, the common parallel processing approach to reading and writing to HDFS, is Java-based.
Hortonworks has created a Hadoop distribution that can run on Windows Server and hence Windows Azure. While Hortonworks offers a Windows Server-based Hadoop distribution, Microsoft so far only has a preview of HDInsight running on Windows Azure. Brust said it's a safe bet that Microsoft will offer a Windows Server-based Hadoop offering. In the meantime, developers can access the raw Hadoop Data Platform (HDP) code.
Users can perform queries against HDInsight with their preferred BI tools, including Microsoft Excel using an ODBC driver for Hive, the Apache-based code that takes SQL Queries to generate MapReduce code. "Technically any ODBC client can query Hadoop," Brust said. "We can take Excel, make Excel talk to Hive, [which] generates the MapReduce code, runs the job, brings the data up, and surfaces it up through the ODBC driver."
Microsoft has taken the uncharacteristic step with HDInsight of contributing its work on it back to the Apache Project. "This is not Microsoft taking the Apache code and making a propriety version, this is Microsoft making a Windows version and keeping the open source code branches," Brust said. "What's interesting is other companies have taken the open source code and have built proprietary distributions, which is historically a bit of a switch."
With the current beta, anyone can sign up from the Windows Azure portal. After signing up for the preview, a tester defines the name of a cluster, choses the number of nodes, and defines a Windows Azure storage account to associate with and an administrative password. It takes about 20 minutes to create the cluster, according to Brust. In the current beta, Brust warned that testers can only use Windows Azure storage instances residing in the U.S. east data center.
The current discounted rate for the preview is whatever a customer would pay for Windows Azure compute and Windows Azure storage, but Microsoft divides the compute rate by two. So if you have a four-node cluster, you effectively are billed for just two nodes, Brust said.
While Microsoft hasn't yet announced when the HDInsight service will become generally available, Brust said he wouldn't be surprised to see it released this summer.
About the Author
Jeffrey Schwartz is editor of Redmond magazine and also covers cloud computing for Virtualization Review's Cloud Report. In addition, he writes the Channeling the Cloud column for Redmond Channel Partner. Follow him on Twitter @JeffreySchwartz.