.NET for Apache Spark - Stream to SQL Server • 3rd man

In this article I am going to describe how to use .NET for Apache Spark with EntityFrameworkCore to stream data to a Microsoft SQL Server. If you have tried this before, you probably stumbled upon the following exception: Microsoft.Data.SqlClient is not supported on this platform.

So let’s find out, how that can be fixed.

Preparation

If you want to stream to an SQL Server, you obviously need to have access to an SQL Server instance first.

Using docker, it is very easy to fire up a related container. I’ve just named it sqlserver, as shown below.

docker run --name sqlserver -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=L3t.M3.1n' -p 1433:1433 -d mcr.microsoft.com/mssql/server:2019-latest

After launching the containers’ interactive terminal via

docker exec -it sqlserver /bin/bash

I use the command line client sqlcmd, to connect to the SQL Server instance using the password that I specified in the docker command.

/opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P L3t.M3.1n

First, let’s create a new database named testdb.

CREATE DATABASE testdb
GO

After switching the context to testdb, I also create a new table mytable.

USE testdb
GO

CREATE TABLE mytable (myid INT, mystring VARCHAR(50))
GO

Following best practices, I won’t use the sysadmin user to access test database. Instead I will create a separate user (testuser_rw) with permissions to read/write from/to the database.

CREATE LOGIN [testuser_rw] WITH PASSWORD='TestM30ut', DEFAULT_DATABASE=[testdb]
GO
CREATE USER [testuser_rw] FOR LOGIN [testuser_rw] WITH DEFAULT_SCHEMA=[dbo]
GO
ALTER ROLE [db_datareader] ADD MEMBER [testuser_rw]
GO
ALTER ROLE [db_datawriter] ADD MEMBER [testuser_rw]
GO

Finally, let’s verify that the required roles are correctly assigned.

:setvar SQLCMDMAXFIXEDTYPEWIDTH 32
:setvar SQLCMDMAXVARTYPEWIDTH 32

SELECT p1.name [user] , p2.name [role] 
  FROM sys.database_role_members rm 
  JOIN sys.database_principals p1 ON rm.member_principal_id = p1.principal_id 
  JOIN sys.database_principals p2 ON rm.role_principal_id = p2.principal_id 
  WHERE p1.name in ('testuser_rw', 'dbo')

GO

Excellent, our test database is now prepared.

HelloForeachWriter

If you followed my post about using the ForeachWriter to stream data to PostgreSQL using EntityFrameworkCore, you could think that doing the same for SQL Server might just require some minor tweaks. Unfortunately, at least for now, it isn’t that straight forward.

I am not going through all the code again, but just describe what it takes to get it work with MSSQL.

There are actually just two things that need to be changed in the example project:

Add the SqlServer provider package
Adjust the MyContext class to work with the new provider

Add the provider package

To add the SqlServer provider package for EntityFrameworkCore, just run the following command at the root folder of the example project.

dotnet add package Microsoft.EntityFrameworkCore.SqlServer --version  3.1.8

HelloForeachWriter.csproj should now contain a related PackageReference:

Adding the EntityFrameworkCore SqlServer Provider

Adjust MyContext.cs

There are two places in MyContext.cs that need to be adjusted. Obviously we now need to change the connection options to use SqlServer and also modify the connection string accordingly, as shown below.

protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
            => optionsBuilder.UseSqlServer("Server=192.168.10.88;Database=testdb;User Id=testuser_rw;Password=TestM30ut");

Additionally we need to adjust the type of the “mystring” column.

modelBuilder.Entity<MyTable>()
            .Property(p => p.MyString)
            .HasColumnName("mystring")
            .HasColumnType("varchar(50)");

Here is a screenshot of the updated MyContext class.

The Problem

It’s time to run HelloForeachWriter in debug mode for the first time now.

Therefore, I first build the project and then change into the related debug directory in order to submit it to spark via:

spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-2.4.x-0.12.1.jar debug

After that, I can start the application in debug mode.

Stream to SQL Server – First Try

Unfortunately that did not work as expected and I got the following exceptions instead:

System.PlatformNotSupportedException: Microsoft.Data.SqlClient is not supported on this platform.

This may be confusing, as there actually is a Microsoft.Data.SqlClient DLL in the directory, along with a related runtime sub-directory.

So the problem seems to be, that there are different versions of the Microsoft.Data.SqlClient DLL, depending on the runtime platform, but Microsoft.Spark.Worker cannot figure out which one to use.

For in-depth information feel free to dig into the Microsoft SqlClient Data Provider for SQL Server repository on GitHub.

A Solution

In theory, there’s actually a quite simple solution for this. I just need to add Microsoft.Data.SqlClient as a dependency to Microsoft.Spark.Worker as well.

I could do this by manually adding the dependencies to Microsoft.Spark.Worker.deps.json and copying the related DLL(s) into the Microsoft.Spark.Worker folder. However, it feels more consistent to me, to just rebuild the worker with Microsoft.Data.SqlClient package added.

Rebuilding Microsoft.Spark.Worker

So let’s get our feet wet and clone the dotnet-spark GitHub repository using

git clone https://github.com/dotnet/spark.git

Next, change into the spark repository directory, navigate to src\csharp\Microsoft.Spark.Worker and add the package.

dotnet add package Microsoft.Data.SqlClient

dotnet add package Microsoft.Data.SqlClient

Depending on your target platform, publish the Microsoft.Spark.Worker. E.g. for 64-bit Windows use:

dotnet publish -f netcoreapp3.1 -r win-x64

And for 64-bit Linux:

dotnet publish -f netcoreapp3.1 -r linux-x64

A more detailed description of the build process for .NET for Apache Spark is available here.

Once published, the resulting files are available within the artifacts sub-folder of the repository. For win-x64 for example, this would be:

artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp3.1\win-x64\publish

And as you can see from the screenshot below, the Microsoft.Data.SqlClient DLL(s) are now included as well.

For Linux, this looks slightly different however.

There’s no Microsoft.Data.SqlClient.SNI.dll in this case.

As a last step, I need to ensure that I either point the DOTNET_WORKER_DIR environment variable to this new directory, or just copy all the files into the directory that is currently referenced by DOTNET_WORKER_DIR.

Stream to SQL Server – Next try

Now it’s time to run the whole process again.

Marvelous! This time we didn’t get any exception.

And the data arrived at the database!

Conclusion

As demonstrated, it isn’t that hard to get .NET for Apache Spark working with EntityFrameworkCore to stream to SQL Server, thanks to .NET for Apache Spark being Open Source.

I am not sure whether the Microsoft.Data.SqlClient package will be added by default, or if there will be any better ways to address this, in the future. That’s why I’ve decided, at least for now, to provide a special version of my dotnet-spark docker image with the package pre-installed: 3rdman/dotnet-spark:0.12.1-spark-2.4.6-linux-sqlclient

Thanks for reading!

1 Comment

Build .NET for Apache Spark with VS Code in a browser • 3rd man
6. October 2020

[…] My last article explained how you can use .NET for Apache Spark together with Entity Framework to stream data to an SQL Server. There was one caveat though. You had to build Microsoft.Spark.Worker yourself.This time I’ll show you how you can build .NET for Apache Spark with VS Code in a browser yourself, including building and running the C# examples. […]

Comments are closed.

Search Site

.NET for Apache Spark – Stream to SQL Server

Preparation

HelloForeachWriter

Add the provider package

Adjust MyContext.cs

The Problem

Stream to SQL Server – First Try

A Solution

Rebuilding Microsoft.Spark.Worker

Stream to SQL Server – Next try

Conclusion

1 Comment

Preparation

HelloForeachWriter

Add the provider package

Adjust MyContext.cs

The Problem

Stream to SQL Server – First Try

A Solution

Rebuilding Microsoft.Spark.Worker

Stream to SQL Server – Next try

Conclusion

You may also like

.NET for Apache Spark 2.0.0 released

IBM Informix and EntityFrameworkCore

.NET for Apache Spark 1.0 is available

1 Comment