Working with the Rescued Data Column in Databricks

Mohammad Gufran Jahangir August 9, 2025 0

In real-world data ingestion scenarios, it’s common to deal with inconsistent or malformed data. When you’re enforcing a schema in Databricks, these mismatches can cause ingestion failures or loss of data — unless you use a powerful feature called the Rescued Data Column.

This post explains what the rescued data column is, how it works, and why it’s useful.

Table of Contents

What is the Rescued Data Column?

The rescued data column is a special column in Databricks (commonly named _rescued_data) that automatically captures any fields from incoming data that do not match the expected schema.

Instead of discarding bad or unexpected data, Databricks stores it as a JSON-formatted string, so you can inspect, clean, and reprocess it later.

How it Works: An Example

Imagine you’re ingesting data from a file into a Bronze table with the following schema:

Column	Data Type
users	STRING
cost	BIGINT

Incoming Data

users	cost
peter	$100
zebi	300

What Happens During Ingestion

Row 1 (Peter)
- The value "$100" is not a valid BIGINT.
- The cost column stores null for this row.
- The original malformed data is captured in the _rescued_data column as: {"cost": "$100", "_file_path": "<file_path>"}
Row 2 (Zebi)
- The value 300 is valid BIGINT.
- It’s stored directly in the cost column.
- _rescued_data remains null.

Resulting Bronze Table

users	cost	_rescued_data
peter	null	{“cost”:”$100″,”_file_path”:”<file_path>”}
zebi	300	null

Why This is Useful

Prevents Data Loss – You don’t lose records just because a field doesn’t match the schema.
Easier Debugging – You can track the original malformed values and the file they came from.
Flexible Cleaning – You can later parse _rescued_data to fix and reprocess problematic fields.
Perfect for Bronze Layer – This approach preserves raw data for auditing while allowing schema enforcement.

Best Practices

Enable the Rescued Data Column for Bronze layer tables where raw ingestion happens.
Periodically review _rescued_data contents to detect data quality issues early.
Automate cleanup by parsing _rescued_data and applying transformations before moving to Silver/Gold layers.

Mohammad Gufran Jahangir

Tags: Databricks

Category:

Databricks