Mohammad Gufran Jahangir August 9, 2025 0

In real-world data ingestion scenarios, it’s common to deal with inconsistent or malformed data. When you’re enforcing a schema in Databricks, these mismatches can cause ingestion failures or loss of data — unless you use a powerful feature called the Rescued Data Column.

This post explains what the rescued data column is, how it works, and why it’s useful.


What is the Rescued Data Column?

The rescued data column is a special column in Databricks (commonly named _rescued_data) that automatically captures any fields from incoming data that do not match the expected schema.

Instead of discarding bad or unexpected data, Databricks stores it as a JSON-formatted string, so you can inspect, clean, and reprocess it later.


How it Works: An Example

Imagine you’re ingesting data from a file into a Bronze table with the following schema:

ColumnData Type
usersSTRING
costBIGINT

Incoming Data

userscost
peter$100
zebi300

What Happens During Ingestion

  1. Row 1 (Peter)
    • The value "$100" is not a valid BIGINT.
    • The cost column stores null for this row.
    • The original malformed data is captured in the _rescued_data column as: {"cost": "$100", "_file_path": "<file_path>"}
  2. Row 2 (Zebi)
    • The value 300 is valid BIGINT.
    • It’s stored directly in the cost column.
    • _rescued_data remains null.

Resulting Bronze Table

userscost_rescued_data
peternull{“cost”:”$100″,”_file_path”:”<file_path>”}
zebi300null

Why This is Useful

  • Prevents Data Loss – You don’t lose records just because a field doesn’t match the schema.
  • Easier Debugging – You can track the original malformed values and the file they came from.
  • Flexible Cleaning – You can later parse _rescued_data to fix and reprocess problematic fields.
  • Perfect for Bronze Layer – This approach preserves raw data for auditing while allowing schema enforcement.

Best Practices

  1. Enable the Rescued Data Column for Bronze layer tables where raw ingestion happens.
  2. Periodically review _rescued_data contents to detect data quality issues early.
  3. Automate cleanup by parsing _rescued_data and applying transformations before moving to Silver/Gold layers.


Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments