Loading Data from Microsoft SQL Server Database with PySpark

Loading Data from Microsoft SQL Server Database with PySpark

In this article we are going to learn how to load Dataset from Microsoft SQL Server with PySpark

We start by downloading the pyodbc library with pip

pip install pyodbc

We also need to set up our Microsoft SQL Server Local host server. This guide is important if you are using the Windows authentication in SQL Server

Step 1: Go to the object explorer tab

Right-click on your Windows authentication login profile, (YOUR_USERNAME\SQLEXPRESS) then click on properties.

Then click on security and choose SQL Server and Windows Authentication mode. Click Ok

Step 2: Go to the object explorer tab

Right-click on the Security folder (BELOW THE DATABASE FOLDER), Click on New then Click on Login.

Create a new Login name, Choose SQL Server Authentication. Create a new password and Confirm the password. Click ok.

Step 3: Open the SQL Server Configuration Manager

protocols for SQL Server Enable the TCP/IP (Click on the TCP/IP and choose Yes to enable)

The TCP/IP has 2 tabs when opened, the protocol and IP address tab would be displayed. After enabling the TCP/IP, wait a few minutes, then on the IP Address tab scroll down to locate the IPAII The TCP Dynamic port is the port your SQL Server Runs on Copy the value it displays.

Close the server configuration manager.

Step 4: Open the SSMS Studio

While trying to log in the server name should look like this

localhost, YOUR TCP/IP SERVER PORT VALUE

example with a random port number

localhost,1433

login name use the name you registered you will be asked to create a new password and confirm it

(Kindly save your Windows and sql server authentication name somewhere save so you can log in with either whenever you want.)

Importing our database with Pyspark

# Load the pyspark libraries

from pyspark.sql import SparkSession
import pyodbc
import pandas as pd


spark = SparkSession.builder \
    .appName("SQL Server to PySpark") \
    .getOrCreate()
# Set up the database connection

db_host = "localhost"
db_port = "YOUR TCP/IP PORT NUMBER"
db_name = "YOUR_DATABASE NAME"
db_user = "YOUR_USERNAME"
db_password = "YOUR_PASSWORD"
# Connect SQL Server with Pyspark

conn = pyodbc.connect(
    "Driver={SQL Server};"
    "Server=" + db_host + "," + db_port + ";"
    "Database=" + db_name + ";"
    "UID=" + db_user + ";"
    "PWD=" + db_password + ";"
)

query = "SELECT * FROM TABLE_NAME"
df_pandas = pd.read_sql(query, conn)
# Convert Data into a Data Frame

df_spark = spark.createDataFrame(df_pandas)
# Read the dataframe

# Print the schema of the DataFrame
df_spark.printSchema()

# Show the first few rows
df_spark.show()

Output from a sample data I used:

The next Article will be Loading Data From MySQL Database With PySpark