Loading Data from MongoDB Database with PySpark

Loading Data from MongoDB Database with PySpark

The last database we will connect with PySpark is MongoDB.

MongoDB is a NoSQL Database that usually outputs data in a JSON File Format.

We start by installing the MongoDB driver for python

pip install pymongo

To set up MongoDB you can download the community server from here and the MongoDB Compass here, you can also try out MongoDB Cloud here. Use this guide to set up the server here.

This MongoDB is a bit technical so after installing your MongoDB shell and compass the next step is to insert data into your admin.

[
  {
    "name": "John Doe",
    "age": 30,
    "email": "john.doe@example.com",
    "address": "123 Main Street"
  },
  {
    "name": "Jane Smith",
    "age": 28,
    "email": "jane.smith@example.com",
    "address": "456 Elm Avenue"
  },
  {
    "name": "Mike Johnson",
    "age": 35,
    "email": "mike.johnson@example.com",
    "address": "789 Oak Road"
  },
  {
    "name": "Alice Brown",
    "age": 25,
    "email": "alice.brown@example.com",
    "address": "101 Maple Lane"
  },
  {
    "name": "Bob Wilson",
    "age": 32,
    "email": "bob.wilson@example.com",
    "address": "222 Cedar Street"
  },
  {
    "name": "Emily Lee",
    "age": 29,
    "email": "emily.lee@example.com",
    "address": "333 Pine Boulevard"
  },
  {
    "name": "Chris Davis",
    "age": 31,
    "email": "chris.davis@example.com",
    "address": "444 Spruce Drive"
  },
  {
    "name": "Sarah Miller",
    "age": 27,
    "email": "sarah.miller@example.com",
    "address": "555 Birch Court"
  },
  {
    "name": "David Taylor",
    "age": 33,
    "email": "david.taylor@example.com",
    "address": "666 Willow Avenue"
  },
  {
    "name": "Olivia Martin",
    "age": 26,
    "email": "olivia.martin@example.com",
    "address": "777 Hickory Lane"
  },
  {
    "name": "James Clark",
    "age": 34,
    "email": "james.clark@example.com",
    "address": "888 Aspen Road"
  },
  {
    "name": "Sophia Hill",
    "age": 28,
    "email": "sophia.hill@example.com",
    "address": "999 Oakwood Court"
  },
  {
    "name": "Michael Turner",
    "age": 30,
    "email": "michael.turner@example.com",
    "address": "111 Willow Circle"
  },
  {
    "name": "Lily Mitchell",
    "age": 29,
    "email": "lily.mitchell@example.com",
    "address": "222 Elm Terrace"
  },
  {
    "name": "Ethan Adams",
    "age": 32,
    "email": "ethan.adams@example.com",
    "address": "333 Maple Road"
  },
  {
    "name": "Ava Hall",
    "age": 27,
    "email": "ava.hall@example.com",
    "address": "444 Oak Avenue"
  },
  {
    "name": "Matthew Cox",
    "age": 31,
    "email": "matthew.cox@example.com",
    "address": "555 Cedar Court"
  },
  {
    "name": "Isabella Rivera",
    "age": 26,
    "email": "isabella.rivera@example.com",
    "address": "666 Pine Lane"
  },
  {
    "name": "Daniel Ward",
    "age": 33,
    "email": "daniel.ward@example.com",
    "address": "777 Birch Boulevard"
  },
  {
    "name": "Mia Torres",
    "age": 28,
    "email": "mia.torres@example.com",
    "address": "888 Spruce Drive"
  }
]

The data above is what I uploaded into my admin folder in MongoDB. I went to my Local Disk (C:)

I went to Program File and then located my MongoDB saved Folder. I right-clicked on the folder and chose open with command line or terminal. On the Command Line, I inserted this

mongod

I also open another command line tab and inserted

mongosh

In the Mongosh command line, I inserted

use admin

I inserted the following code to update the admin Database in MongoDB with data(A sample data created by ChatGPT)

db.your_collection.insert([
  {
    "name": "John Doe",
    "age": 30,
    "email": "john.doe@example.com",
    "address": "123 Main Street"
  },
  {
    "name": "Jane Smith",
    "age": 28,
    "email": "jane.smith@example.com",
    "address": "456 Elm Avenue"
  },
  {
    "name": "Mike Johnson",
    "age": 35,
    "email": "mike.johnson@example.com",
    "address": "789 Oak Road"
  },
  {
    "name": "Alice Brown",
    "age": 25,
    "email": "alice.brown@example.com",
    "address": "101 Maple Lane"
  },
  {
    "name": "Bob Wilson",
    "age": 32,
    "email": "bob.wilson@example.com",
    "address": "222 Cedar Street"
  },
  {
    "name": "Emily Lee",
    "age": 29,
    "email": "emily.lee@example.com",
    "address": "333 Pine Boulevard"
  },
  {
    "name": "Chris Davis",
    "age": 31,
    "email": "chris.davis@example.com",
    "address": "444 Spruce Drive"
  },
  {
    "name": "Sarah Miller",
    "age": 27,
    "email": "sarah.miller@example.com",
    "address": "555 Birch Court"
  },
  {
    "name": "David Taylor",
    "age": 33,
    "email": "david.taylor@example.com",
    "address": "666 Willow Avenue"
  },
  {
    "name": "Olivia Martin",
    "age": 26,
    "email": "olivia.martin@example.com",
    "address": "777 Hickory Lane"
  },
  {
    "name": "James Clark",
    "age": 34,
    "email": "james.clark@example.com",
    "address": "888 Aspen Road"
  },
  {
    "name": "Sophia Hill",
    "age": 28,
    "email": "sophia.hill@example.com",
    "address": "999 Oakwood Court"
  },
  {
    "name": "Michael Turner",
    "age": 30,
    "email": "michael.turner@example.com",
    "address": "111 Willow Circle"
  },
  {
    "name": "Lily Mitchell",
    "age": 29,
    "email": "lily.mitchell@example.com",
    "address": "222 Elm Terrace"
  },
  {
    "name": "Ethan Adams",
    "age": 32,
    "email": "ethan.adams@example.com",
    "address": "333 Maple Road"
  },
  {
    "name": "Ava Hall",
    "age": 27,
    "email": "ava.hall@example.com",
    "address": "444 Oak Avenue"
  },
  {
    "name": "Matthew Cox",
    "age": 31,
    "email": "matthew.cox@example.com",
    "address": "555 Cedar Court"
  },
  {
    "name": "Isabella Rivera",
    "age": 26,
    "email": "isabella.rivera@example.com",
    "address": "666 Pine Lane"
  },
  {
    "name": "Daniel Ward",
    "age": 33,
    "email": "daniel.ward@example.com",
    "address": "777 Birch Boulevard"
  },
  {
    "name": "Mia Torres",
    "age": 28,
    "email": "mia.torres@example.com",
    "address": "888 Spruce Drive"
  }
])

In your Python IDE insert the following codes

from pyspark.sql import SparkSession
from pymongo import MongoClient

# Replace 'mongodb_connection_url' with your actual MongoDB connection URL
client = MongoClient('mongodb://localhost:27017/admin')

# Replace 'database_name' and 'collection_name' with your actual database and collection names
db = client['admin']
collection = db['your_collection']

# Fetch the data as a list of dictionaries
data = list(collection.find())

# Create a SparkSession
spark = SparkSession.builder \
    .appName("MongoDB with PySpark") \
    .getOrCreate()

# Convert ObjectId to string in the data
for item in data:
    item['_id'] = str(item['_id'])

# Create the DataFrame
df = spark.createDataFrame(data)

# Show the DataFrame
df.show()

The Output:

Next article we will learn how to load a CSV file with PySpark.

Happy Learning!!!