Semi-structured data is everywhere — from web APIs and IoT devices to log files and application events. One of the most common semi-structured formats is JSON (JavaScript Object Notation). It’s human-readable, flexible, and easy to work with — but it can also be tricky to ingest and query efficiently at scale.
In this blog, we’ll explore how to ingest JSON data into Databricks, why it’s important, and some best practices.
What is Semi-Structured Data?
Unlike structured data (rows and columns in a relational database) or completely unstructured data (images, audio, video), semi-structured data contains elements of both:
- It’s organized in a predictable format (e.g., key-value pairs in JSON).
- It’s flexible, allowing different records to have varying fields.
Example JSON:
{
"user": "john_doe",
"action": "login",
"device": {
"type": "mobile",
"os": "iOS"
},
"timestamp": "2025-08-09T10:15:30Z"
}
Why JSON is Common in Data Engineering
- APIs & Services: Most modern APIs return JSON.
- Logs & Events: Cloud services, microservices, and IoT devices often log in JSON format.
- Flexibility: Schema can evolve without breaking the entire dataset.
Ingesting JSON in Databricks
Databricks provides multiple ways to ingest JSON files stored in cloud storage (Azure Data Lake, S3, GCS, etc.).
1. Using Auto Loader for Incremental Ingestion
If JSON data is continuously landing in cloud storage, Databricks Auto Loader can incrementally load new files:
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/mnt/data/json/"))
- Pros: Scalable, handles schema evolution automatically.
- Best For: Real-time or near-real-time ingestion.
2. Using spark.read.json for Batch Loads
For one-time or periodic batch ingestion:
df = spark.read.json("/mnt/data/json/")
- Pros: Simple to implement.
- Best For: Static datasets or small batch processing.
3. Using SQL
Databricks SQL supports creating tables directly from JSON files:
CREATE TABLE events
USING json
OPTIONS (path "/mnt/data/json/");
Working with Nested JSON
JSON often contains nested structures. Databricks allows you to:
- Flatten nested fields using
explodeandselect. - Access nested keys using dot notation:
df.select("device.type", "device.os")
Schema Evolution and Rescued Data
When a new JSON field appears:
- Auto Loader with rescued data column can capture unexpected fields in
_rescued_data. - This ensures no data is lost while you update your schema.
Best Practices
- Infer Schema Carefully – Avoid repeated inference on large datasets; define schema explicitly for performance.
- Use Delta Format – Store JSON data in Delta Lake for ACID transactions and efficient querying.
- Monitor Schema Drift – Track changes in structure over time to avoid breaking pipelines.
- Optimize for Queries – Flatten and transform JSON into a query-friendly format in the Silver layer.