News

.NET for Apache Spark Debuts in Version 1.0

The open source project .NET for Apache Spark has debuted in version 1.0, finally vaulting the C# and F# programming languages into Big Data first-class citizenship.

Spearheaded by Microsoft and the .NET Foundation, the package -- some two years in the making -- finally introduces .NET into the Big Data mix with a foothold provided by Apache Spark, one of the most popular open source projects of all time. After Big Data arrived upon the scene years ago with the promise of providing data-driven business insights to organizations, Apache Spark soon arrived to get all the pieces moving together with a unified analytics engine for Big Data processing via ready-made components for streaming, SQL, machine learning and graph processing.

With Microsoft extending C#/F# use cases from beyond the desktop to web, mobile, cloud and elsewhere, Big Data is just the latest outreach effort.

.NET for Apache Spark broke onto the scene last year, building upon the existing scheme that allowed for .NET to be used in Big Data projects via the precursor Mobius project and C# and F# language bindings and extensions used to leverage an interop layer with APIs for programming languages like Java, Python, Scala and R.

.NET for Apache Spark Architecture
[Click on image for larger view.] .NET for Apache Spark Architecture (source: Microsoft).

The v1.0 project now provides high-performance APIs for using Spark from C# and F# to access:

  • DataFrame and SparkSQL for working with structured data.
  • Spark Structured Streaming for working with streaming data.
  • Spark SQL for writing queries with SQL syntax.
  • Machine learning integration for faster training and prediction (that is, use .NET for Apache Spark alongside ML.NET).

"Version 1.0 includes support for .NET applications targeting .NET Standard 2.0 or later. Access to the Apache Spark DataFrame APIs (versions 2.3, 2.4 and 3.0) and the ability to write Spark SQL and create user-defined functions (UDFs) are also included in the release," an announcement post said.

Along with that expanded functionality, performance hasn't suffered, as Microsoft said programs that don't use user-defined functions (UDFs) are just as fast as Scala and PySpark-based non-UDF Spark applications, while applications including UDFs are at least as fast as PySpark programs and often faster.

A blog post provides the historical background for the project:

"About two years ago, we heard an increasing demand from the .NET community for an easier way to build big data applications with .NET instead of having to learn Scala or Python. Thus, we built a team based on members from the Azure Data engineering team (including members of the previous Mobius team) and members from the .NET team and started the .NET for Apache Spark open-source project. While we operate the project under the .NET Foundation, we did file Spark Project Improvement Proposals (SPIPs) SPARK-26257 and SPARK-27006 to have the work included in the Apache Spark project directly, if the community decides that is the right way to move forward. "We announced the first public version in April 2019 during the Databricks Spark+AI Summit 2019 and at Microsoft Build 2019, and since then have regularly released updates to the package. In fact we have shipped 12 pre-releases, included 318 pull requests by the time of version 1.0, addressed 286 issues filed on GitHub, included it in several of our own Spark offerings (Azure HDInsight and Azure Synapse Analytics) as well as integrated it with the .NET Interactive notebook experiences."

Microsoft said it conducted a survey about .NET developers and Big Data that found:

  • Most .NET developers want to use languages they are familiar with like C# and F# to build big data projects.
  • Support for features like Language Integrated Query (LINQ) is important.
  • The largest obstacles faced include setting up prerequisites and dependencies and finding quality documentation.
  • A top priority is supporting deployment options, including integration with CI/CD DevOps pipelines and publishing or submitting jobs directly from Visual Studio.

Regarding the latter item, the project's roadmap shows tooling improvements underway to make it easier to copy and paste Scala examples into Visual Studio, along with Visual Studio extensions for submitting an app to a remote Spark cluster and for .NET app debugging.

To address other concerns, Microsoft and project stewards have created community-contributed "ready-to-run" Docker images and sparked a recent wave of updates to the .NET for Spark documentation.

"We are continuously updating the project to keep pace with .NET developments, Spark releases, and feature requests such as GroupMap," the company said.

About the Author

David Ramel is an editor and writer at Converge 360.

comments powered by Disqus

Featured

  • Microsoft Revamps Fledgling AutoGen Framework for Agentic AI

    Only at v0.4, Microsoft's AutoGen framework for agentic AI -- the hottest new trend in AI development -- has already undergone a complete revamp, going to an asynchronous, event-driven architecture.

  • IDE Irony: Coding Errors Cause 'Critical' Vulnerability in Visual Studio

    In a larger-than-normal Patch Tuesday, Microsoft warned of a "critical" vulnerability in Visual Studio that should be fixed immediately if automatic patching isn't enabled, ironically caused by coding errors.

  • Building Blazor Applications

    A trio of Blazor experts will conduct a full-day workshop for devs to learn everything about the tech a a March developer conference in Las Vegas keynoted by Microsoft execs and featuring many Microsoft devs.

  • Gradient Boosting Regression Using C#

    Dr. James McCaffrey from Microsoft Research presents a complete end-to-end demonstration of the gradient boosting regression technique, where the goal is to predict a single numeric value. Compared to existing library implementations of gradient boosting regression, a from-scratch implementation allows much easier customization and integration with other .NET systems.

  • Microsoft Execs to Tackle AI and Cloud in Dev Conference Keynotes

    AI unsurprisingly is all over keynotes that Microsoft execs will helm to kick off the Visual Studio Live! developer conference in Las Vegas, March 10-14, which the company described as "a must-attend event."

Subscribe on YouTube