,

Schema Changes Are Not Reflected in Shared Datasets in Databricks

Posted by

Introduction

When working with shared datasets in Databricks (Unity Catalog, Delta Sharing, or external tables), schema changes may not appear in downstream consumers. This can cause query failures, missing columns, or outdated metadata.

🚨 Common issues caused by schema changes not reflecting in shared datasets:

  • New columns do not appear in queries from shared consumers.
  • Schema updates take too long to propagate.
  • Queries return outdated results after schema modification.
  • Schema evolution settings are ignored in Delta tables.

This guide explores causes, troubleshooting steps, and best practices to ensure schema changes reflect properly in shared datasets.


1. Verify If the Schema Change Was Committed to the Source Table

Symptoms:

  • New columns or data types are not appearing in queries.
  • Schema updates appear in the source table but not in shared datasets.

Causes:

  • Schema updates were not committed properly.
  • Changes were made in a separate session and not reflected in downstream queries.
  • Databricks caching mechanisms may be serving stale metadata.

Fix:

Verify schema changes in the source table:

DESCRIBE TABLE my_table;

Ensure the schema change was applied correctly:

ALTER TABLE my_table ADD COLUMN new_column STRING;

Refresh table metadata in downstream consumers:

REFRESH TABLE my_table;

Manually sync schema changes to shared datasets:

ALTER TABLE my_table SET TBLPROPERTIES ('delta.feature.allowColumnSchemaEvolution' = 'true');

2. Unity Catalog Tables Not Reflecting Schema Changes

Symptoms:

  • Schema updates in Unity Catalog are not visible in shared workspaces.
  • Consumers see an outdated schema when querying shared tables.

Causes:

  • Unity Catalog caches table schemas and does not update them immediately.
  • Consumers are querying an outdated version of the table.
  • Schema evolution settings are not enabled.

Fix:

Ensure that Unity Catalog propagates schema updates:

ALTER TABLE my_catalog.my_schema.my_table ADD COLUMN new_column STRING;

Use ALTER TABLE to enable schema evolution:

ALTER TABLE my_catalog.my_schema.my_table SET TBLPROPERTIES ('delta.autoOptimize.optimizeWrite' = 'true', 'delta.autoOptimize.autoCompact' = 'true');

Manually refresh the shared dataset for consumers:

REFRESH TABLE my_catalog.my_schema.my_table;

If using Delta Sharing, ensure the recipient re-queries the updated metadata.


3. Delta Sharing Consumers Do Not See Schema Updates

Symptoms:

  • Schema changes are visible in the provider’s Databricks workspace but not in the shared recipient’s query results.
  • New columns added to a Delta table do not appear in the recipient’s queries.

Causes:

  • Delta Sharing snapshots are not automatically refreshed.
  • Recipients need to re-sync the shared dataset.
  • Schema evolution settings are not enabled for the shared table.

Fix:

Manually refresh Delta Sharing metadata:

ALTER SHARE my_share REFRESH;

Ensure schema evolution is enabled for Delta tables:

ALTER TABLE my_table SET TBLPROPERTIES ('delta.feature.allowColumnSchemaEvolution' = 'true');

Ask recipients to manually refresh their copy of the shared table:

REFRESH TABLE shared_catalog.shared_table;

4. Cloud Storage External Tables Not Reflecting Schema Updates

Symptoms:

  • Schema changes made in external tables (AWS S3, Azure ADLS, Google Cloud Storage) do not appear in queries.
  • Partitions are missing or outdated.

Causes:

  • Schema changes were not applied to the external table definition.
  • Partition metadata is stale and needs to be refreshed.
  • Query engines cache metadata and do not automatically detect schema changes.

Fix:

Refresh external table metadata:

MSCK REPAIR TABLE my_external_table;

Use REFRESH TABLE to reload schema metadata:

REFRESH TABLE my_external_table;

If using Delta tables on external storage, manually sync schema changes:

ALTER TABLE my_table SET TBLPROPERTIES ('delta.feature.allowColumnSchemaEvolution' = 'true');

5. Schema Evolution Not Working in Delta Tables

Symptoms:

  • Schema changes do not automatically apply when appending new data.
  • Queries fail due to schema mismatch.

Causes:

  • Schema evolution is not enabled in Delta Lake.
  • Merge schema options were not applied when writing data.

Fix:

Enable schema evolution when writing new data:

df.write.format("delta").mode("append").option("mergeSchema", "true").save("s3://my-bucket/path")

Ensure schema evolution is enabled for the table:

ALTER TABLE my_table SET TBLPROPERTIES ('delta.feature.allowColumnSchemaEvolution' = 'true');

Use Auto Merge when performing upserts:

SET spark.databricks.delta.schema.autoMerge.enabled = true;

6. Cached Metadata Causing Schema Update Delays

Symptoms:

  • Queries return old schema versions after making schema changes.
  • Column additions are not visible immediately.

Causes:

  • Databricks caches table metadata and does not refresh it automatically.
  • Long-running clusters may be using stale schema definitions.

Fix:

Manually refresh table metadata:

REFRESH TABLE my_table;

Clear Databricks cache if necessary:

spark.catalog.clearCache()

For Delta tables, use:

ALTER TABLE my_table SET TBLPROPERTIES ('delta.feature.allowColumnSchemaEvolution' = 'true');

Step-by-Step Troubleshooting Guide

Step 1: Verify Schema Changes at the Source

DESCRIBE TABLE my_table;

If the schema change is not listed, retry the schema update.

Step 2: Refresh Table Metadata

REFRESH TABLE my_table;

Step 3: Ensure Schema Evolution Is Enabled for Delta Tables

ALTER TABLE my_table SET TBLPROPERTIES ('delta.feature.allowColumnSchemaEvolution' = 'true');

Step 4: If Using Unity Catalog, Refresh the Shared Table

ALTER SHARE my_share REFRESH;

Step 5: If Using External Storage, Repair the Table

MSCK REPAIR TABLE my_external_table;

Step 6: Clear Cache if Queries Still Show Old Schema

spark.catalog.clearCache()

Best Practices for Schema Management in Shared Datasets

Use Schema Evolution for Delta Tables

ALTER TABLE my_table SET TBLPROPERTIES ('delta.feature.allowColumnSchemaEvolution' = 'true');

Manually Refresh Shared Datasets to Sync Changes

REFRESH TABLE my_shared_table;

Avoid Over-Reliance on Cached Metadata

  • Periodically clear cache in long-running Databricks clusters.

Enable Auto Schema Evolution for Streaming and Batch Data Writes

df.write.format("delta").mode("append").option("mergeSchema", "true").save("s3://my-bucket/path")

Conclusion

If schema changes are not reflecting in shared datasets, ensure that:
Schema updates were successfully committed to the source table.
Shared datasets (Unity Catalog, Delta Sharing) have been refreshed.
Schema evolution is enabled for Delta tables and external tables.
Consumers refresh their queries and metadata manually if needed.

By following this guide, you can ensure schema changes are reflected in shared datasets across Databricks environments.

guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x