News

'.NET for Apache Spark' Debuts for C#/F# Big Data

Almost four years after the debut of Apache Spark, .NET developers are on track to more easily use the popular Big Data processing framework in C# and F# projects.

The preview project, called .NET for Apache Spark, was unveiled yesterday (April 24). Its development will be conducted in the open under the direction of the .NET Foundation.

Spark is described as a unified analytics engine for large-scale data processing, compatible with Apache Hadoop data whether batched or streamed.

Currently, Spark is accessible via an interop layer with APIs for the Java, Python, Scala and R programming languages. While .NET coders have been able to use Spark with Mobius C# and F# language binding and extensions, the new project seeks to improve on that scheme while paving the way to add more language support. Microsoft promised to work closely with the open source Spark community to help the project succeed beyond similar efforts such as Mobius, which it said were hindered by a lack of communication.

".NET for Apache Spark provides high performance APIs for using Spark from C# and F#," said Microsoft in an announcement post. "With [these] .NET APIs, you can access all aspects of Apache Spark including Spark SQL, DataFrames, Streaming, MLLib etc.," it said. ".NET for Apache Spark lets you reuse all the knowledge, skills, code, and libraries you already have as a .NET developer."

The project's origin is explained in a Spark Project Improvement Proposal (SPIP) titled .NET bindings for Apache Spark created on Feb. 27. It says: "Apache Spark provides programming language support for Scala/Java (native), and extensions for Python and R. While a variety of other language extensions are possible to include in Apache Spark, .NET would bring one of the largest developer community to the table. Presently, no good Big Data solution exists for .NET developers in open source. This SPIP aims at discussing how we can bring Apache Spark goodness to the .NET development platform."

Microsoft yesterday said: "The C#/F# language binding to Spark will be written on a new Spark interop layer which offers easier extensibility. This new layer of Spark interop was written keeping in mind best practices for language extension and optimizes for interop and performance. Long term this extensibility can be used for adding support for other languages in Spark."

Project backers will work on that extensibility, which was outlined in yet another SPIP titled Interop Support for Spark Language Extensions created last December that says:

There is a desire for third party language extensions for Apache Spark. Some notable examples include: Presently, Apache Spark supports Python and R via a tightly integrated interop layer. It would seem that much of that existing interop layer could be refactored into a clean surface for general (third party) language bindings...."

Microsoft addressed the aforementioned lack of communication with the open source Spark community in its SPIP, stating:

We recognize that earlier attempts at this goal (specifically Mobius https://github.com/Microsoft/Mobius) were unsuccessful primarily due to the lack of communication with the Spark community. Therefore, another goal of this proposal is to not only develop .NET bindings for Spark in open source, but also continuously seek feedback from the Spark community via posted Jiraโ€™s (like this one) and the Spark developer mailing list. Our hope is that through these engagements, we can build a community of developers that are eager to contribute to this effort or want to leverage the resulting .NET bindings for Spark in their respective Big Data applications.

Yesterday's announcement of the first preview also provided a peek into further development, which will include improving benchmarking performance, such as Arrow optimizations. Specifically, the project's roadmap calls for upcoming features such as:

  • Simplified getting started experience, documentation and samples
  • Native integration with developer tools such as Visual Studio, Visual Studio Code, Jupyter notebooks
  • .NET support for user-defined aggregate functions
  • .NET idiomatic APIs for C# and F# (e.g., using LINQ for writing queries)
  • Out of the box support with Azure Databricks, Kubernetes etc.
  • Make .NET for Apache Spark part of Spark Core

Source code for the preview project and detailed instructions on using it and interacting with it can be found on GitHub, where it has already garnered 446 stars at the time of this writing (climbing by the minute), with Microsoft's Terry Kim and Rahul Potharaju listed as primary contributors.

About the Author

David Ramel is an editor and writer at Converge 360.

comments powered by Disqus

Featured

Subscribe on YouTube