Mohammad Gufran Jahangir August 9, 2025 0

Semi-structured data is everywhere — from web APIs and IoT devices to log files and application events. One of the most common semi-structured formats is JSON (JavaScript Object Notation). It’s human-readable, flexible, and easy to work with — but it can also be tricky to ingest and query efficiently at scale.

In this blog, we’ll explore how to ingest JSON data into Databricks, why it’s important, and some best practices.


What is Semi-Structured Data?

Unlike structured data (rows and columns in a relational database) or completely unstructured data (images, audio, video), semi-structured data contains elements of both:

  • It’s organized in a predictable format (e.g., key-value pairs in JSON).
  • It’s flexible, allowing different records to have varying fields.

Example JSON:

{
  "user": "john_doe",
  "action": "login",
  "device": {
    "type": "mobile",
    "os": "iOS"
  },
  "timestamp": "2025-08-09T10:15:30Z"
}

Why JSON is Common in Data Engineering

  • APIs & Services: Most modern APIs return JSON.
  • Logs & Events: Cloud services, microservices, and IoT devices often log in JSON format.
  • Flexibility: Schema can evolve without breaking the entire dataset.

Ingesting JSON in Databricks

Databricks provides multiple ways to ingest JSON files stored in cloud storage (Azure Data Lake, S3, GCS, etc.).

1. Using Auto Loader for Incremental Ingestion

If JSON data is continuously landing in cloud storage, Databricks Auto Loader can incrementally load new files:

df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .load("/mnt/data/json/"))
  • Pros: Scalable, handles schema evolution automatically.
  • Best For: Real-time or near-real-time ingestion.

2. Using spark.read.json for Batch Loads

For one-time or periodic batch ingestion:

df = spark.read.json("/mnt/data/json/")
  • Pros: Simple to implement.
  • Best For: Static datasets or small batch processing.

3. Using SQL

Databricks SQL supports creating tables directly from JSON files:

CREATE TABLE events
USING json
OPTIONS (path "/mnt/data/json/");

Working with Nested JSON

JSON often contains nested structures. Databricks allows you to:

  • Flatten nested fields using explode and select.
  • Access nested keys using dot notation: df.select("device.type", "device.os")

Schema Evolution and Rescued Data

When a new JSON field appears:

  • Auto Loader with rescued data column can capture unexpected fields in _rescued_data.
  • This ensures no data is lost while you update your schema.

Best Practices

  1. Infer Schema Carefully – Avoid repeated inference on large datasets; define schema explicitly for performance.
  2. Use Delta Format – Store JSON data in Delta Lake for ACID transactions and efficient querying.
  3. Monitor Schema Drift – Track changes in structure over time to avoid breaking pipelines.
  4. Optimize for Queries – Flatten and transform JSON into a query-friendly format in the Silver layer.

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments