.NET for Apache Spark – Stream to SQL Server

In this article I am going to describe how to use .NET for Apache Spark with EntityFrameworkCore to stream data to a Microsoft SQL Server. If you have tried this before, you probably stumbled upon the following exception: Microsoft.Data.SqlClient is not supported on this platform.

So let’s find out, how that can be fixed.

Preparation

If you want to stream to an SQL Server, you obviously need to have access to an SQL Server instance first.

Using docker, it is very easy to fire up a related container. I’ve just named it sqlserver, as shown below.

docker run --name sqlserver -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=L3t.M3.1n' -p 1433:1433 -d mcr.microsoft.com/mssql/server:2019-latest

After launching the containers’ interactive terminal via

docker exec -it sqlserver /bin/bash

I use the command line client sqlcmd, to connect to the SQL Server instance using the password that I specified in the docker command.

/opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P L3t.M3.1n

First, let’s create a new database named testdb.

CREATE DATABASE testdb
GO

After switching the context to testdb, I also create a new table mytable.

USE testdb
GO

CREATE TABLE mytable (myid INT, mystring VARCHAR(50))
GO

Following best practices, I won’t use the sysadmin user to access test database. Instead I will create a separate user (testuser_rw) with permissions to read/write from/to the database.

CREATE LOGIN [testuser_rw] WITH PASSWORD='TestM30ut', DEFAULT_DATABASE=[testdb]
GO
CREATE USER [testuser_rw] FOR LOGIN [testuser_rw] WITH DEFAULT_SCHEMA=[dbo]
GO
ALTER ROLE [db_datareader] ADD MEMBER [testuser_rw]
GO
ALTER ROLE [db_datawriter] ADD MEMBER [testuser_rw]
GO

Finally, let’s verify that the required roles are correctly assigned.

:setvar SQLCMDMAXFIXEDTYPEWIDTH 32
:setvar SQLCMDMAXVARTYPEWIDTH 32

SELECT p1.name [user] , p2.name [role] 
  FROM sys.database_role_members rm 
  JOIN sys.database_principals p1 ON rm.member_principal_id = p1.principal_id 
  JOIN sys.database_principals p2 ON rm.role_principal_id = p2.principal_id 
  WHERE p1.name in ('testuser_rw', 'dbo')

GO
SQL Server - confirm role membership

Excellent, our test database is now prepared.

HelloForeachWriter

MyForeachWriter

If you followed my post about using the ForeachWriter to stream data to PostgreSQL using EntityFrameworkCore, you could think that doing the same for SQL Server might just require some minor tweaks. Unfortunately, at least for now, it isn’t that straight forward.

I am not going through all the code again, but just describe what it takes to get it work with MSSQL.

There are actually just two things that need to be changed in the example project:

  1. Add the SqlServer provider package
  2. Adjust the MyContext class to work with the new provider

Add the provider package

To add the SqlServer provider package for EntityFrameworkCore, just run the following command at the root folder of the example project.

dotnet add package Microsoft.EntityFrameworkCore.SqlServer --version  3.1.8

HelloForeachWriter.csproj should now contain a related PackageReference:

Adding the EntityFrameworkCore SqlServer Provider

Adjust MyContext.cs

There are two places in MyContext.cs that need to be adjusted. Obviously we now need to change the connection options to use SqlServer and also modify the connection string accordingly, as shown below.

protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
            => optionsBuilder.UseSqlServer("Server=192.168.10.88;Database=testdb;User Id=testuser_rw;Password=TestM30ut");

Additionally we need to adjust the type of the “mystring” column.

modelBuilder.Entity<MyTable>()
            .Property(p => p.MyString)
            .HasColumnName("mystring")
            .HasColumnType("varchar(50)");

Here is a screenshot of the updated MyContext class.

Adjusting MyContext

The Problem

It’s time to run HelloForeachWriter in debug mode for the first time now.

Therefore, I first build the project and then change into the related debug directory in order to submit it to spark via:

spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-2.4.x-0.12.1.jar debug

After that, I can start the application in debug mode.

Stream to SQL Server – First Try

Stream to SQL Server - exception

Unfortunately that did not work as expected and I got the following exceptions instead:

System.PlatformNotSupportedException: Microsoft.Data.SqlClient is not supported on this platform.

This may be confusing, as there actually is a Microsoft.Data.SqlClient DLL in the directory, along with a related runtime sub-directory.

Microsoft.Data.SqlClient

So the problem seems to be, that there are different versions of the Microsoft.Data.SqlClient DLL, depending on the runtime platform, but Microsoft.Spark.Worker cannot figure out which one to use.

For in-depth information feel free to dig into the Microsoft SqlClient Data Provider for SQL Server repository on GitHub.

A Solution

In theory, there’s actually a quite simple solution for this. I just need to add Microsoft.Data.SqlClient as a dependency to Microsoft.Spark.Worker as well.

I could do this by manually adding the dependencies to Microsoft.Spark.Worker.deps.json and copying the related DLL(s) into the Microsoft.Spark.Worker folder. However, it feels more consistent to me, to just rebuild the worker with Microsoft.Data.SqlClient package added.

Rebuilding Microsoft.Spark.Worker

So let’s get our feet wet and clone the dotnet-spark GitHub repository using

git clone https://github.com/dotnet/spark.git

Next, change into the spark repository directory, navigate to src\csharp\Microsoft.Spark.Worker and add the package.

dotnet add package Microsoft.Data.SqlClient
dotnet add package Microsoft.Data.SqlClient

Depending on your target platform, publish the Microsoft.Spark.Worker. E.g. for 64-bit Windows use:

dotnet publish -f netcoreapp3.1 -r win-x64

And for 64-bit Linux:

dotnet publish -f netcoreapp3.1 -r linux-x64

A more detailed description of the build process for .NET for Apache Spark is available here.

Once published, the resulting files are available within the artifacts sub-folder of the repository. For win-x64 for example, this would be:

artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp3.1\win-x64\publish

And as you can see from the screenshot below, the Microsoft.Data.SqlClient DLL(s) are now included as well.

DLL(s) for Windows

For Linux, this looks slightly different however.

DLL for Linux

There’s no Microsoft.Data.SqlClient.SNI.dll in this case.

As a last step, I need to ensure that I either point the DOTNET_WORKER_DIR environment variable to this new directory, or just copy all the files into the directory that is currently referenced by DOTNET_WORKER_DIR.

Stream to SQL Server – Next try

Now it’s time to run the whole process again.

Stream to SQL Server - working

Marvelous! This time we didn’t get any exception.

And the data arrived at the database!

Conclusion

As demonstrated, it isn’t that hard to get .NET for Apache Spark working with EntityFrameworkCore to stream to SQL Server, thanks to .NET for Apache Spark being Open Source.

I am not sure whether the Microsoft.Data.SqlClient package will be added by default, or if there will be any better ways to address this, in the future. That’s why I’ve decided, at least for now, to provide a special version of my dotnet-spark docker image with the package pre-installed: 3rdman/dotnet-spark:0.12.1-spark-2.4.6-linux-sqlclient

Thanks for reading!

1 Comment

  1. […] My last article explained how you can use .NET for Apache Spark together with Entity Framework to stream data to an SQL Server. There was one caveat though. You had to build Microsoft.Spark.Worker yourself.This time I’ll show you how you can build .NET for Apache Spark with VS Code in a browser yourself, including building and running the C# examples. […]

Comments are closed.

Scroll to top