,

Performance Issues When Querying Shared Tables in Databricks

Posted by

Introduction

Databricks allows table sharing across workspaces using Unity Catalog and Delta Sharing, enabling multiple teams to collaborate on data. However, querying shared tables can sometimes cause performance bottlenecks, leading to slow query execution, high latency, or unexpected failures.

🚨 Common issues when querying shared tables:

  • Slow query performance compared to local tables.
  • Increased query execution time due to network latency.
  • High compute and memory usage when querying large shared datasets.
  • Frequent query failures due to storage or permission errors.

This guide will help you diagnose and optimize performance when querying shared tables in Databricks.


Understanding Shared Tables in Databricks

There are two main ways to share tables in Databricks:

  1. Unity Catalog Table Sharing (Cross-Workspace Sharing)
    • Tables are shared between Databricks workspaces within the same organization.
    • Uses Unity Catalog’s managed data governance model.
  2. Delta Sharing (External Data Sharing)
    • Shares Delta tables with external organizations over the Delta Sharing protocol.
    • Uses secure access to tables stored in cloud storage.

💡 Why are shared tables slower than local tables?

  • Queries on shared tables involve additional network hops compared to local tables.
  • Lack of caching due to read-only permissions on shared datasets.
  • Optimizations like indexing, clustering, or materialized views may not be enabled on shared tables.

Common Performance Issues and Fixes

1. Queries on Shared Tables Are Slower Than Local Tables

Symptoms:

  • Queries on shared tables take significantly longer than the same queries on local tables.
  • Slow joins and aggregations involving shared tables.

Causes:

  • Shared tables may reside in a different workspace or region, increasing network latency.
  • Data skipping and indexing are not applied if the table is not optimized.
  • Lack of caching leads to repeated expensive reads from storage.

Fix:
Enable Delta Caching for Faster Reads

spark.conf.set("spark.databricks.io.cache.enabled", "true")
df = spark.read.table("shared_catalog.shared_schema.shared_table").cache()
df.count()  # Trigger caching

Optimize Shared Tables for Faster Query Execution

  • If the table owner allows, run OPTIMIZE on shared Delta tables:
OPTIMIZE shared_catalog.shared_schema.shared_table ZORDER BY (primary_key);
  • Ensure statistics are collected for faster query planning:
ANALYZE TABLE shared_catalog.shared_schema.shared_table COMPUTE STATISTICS;

Replicate Data to Local Storage (If Allowed)

  • If query performance is critical, copy shared data to a local Delta table:
CREATE TABLE local_schema.local_table AS SELECT * FROM shared_catalog.shared_schema.shared_table;

2. High Query Latency Due to External Data Storage

Symptoms:

  • High I/O wait times when querying shared tables.
  • Queries take longer than expected, even for small datasets.

Causes:

  • Tables stored in external cloud storage (S3, ADLS, GCS) introduce latency.
  • Cross-region access to data increases response times.

Fix:
Ensure That Databricks and Cloud Storage Are in the Same Region

  • Check the storage location of the shared table:
DESCRIBE EXTENDED shared_catalog.shared_schema.shared_table;
  • If the table is in another region, consider migrating or replicating the data to a closer storage location.

Use Local Storage or Delta Caching to Reduce I/O Latency

  • Cache frequently accessed tables in memory:
df = spark.read.table("shared_catalog.shared_schema.shared_table").cache()
  • If possible, store replicated datasets in DBFS for faster access.

3. Slow Joins Between Shared Tables and Local Tables

Symptoms:

  • Queries involving joins with shared tables take longer than expected.
  • Broadcast joins fail due to large dataset sizes.

Causes:

  • Shared tables are not optimized for joins (e.g., lack of partitioning or indexing).
  • Data skew issues cause some workers to handle more data than others.
  • Broadcast joins fail due to exceeding broadcast threshold.

Fix:
Use Broadcast Joins for Small Tables

from pyspark.sql.functions import broadcast

df_small = broadcast(spark.read.table("shared_catalog.shared_schema.small_table"))
df_large = spark.read.table("local_schema.local_table")

df_joined = df_large.join(df_small, "id")
df_joined.show()

Partition Shared Tables for Better Join Performance

  • Ensure shared tables are partitioned on frequently used join keys:
ALTER TABLE shared_catalog.shared_schema.shared_table SET PARTITION COLUMN (customer_id);

Enable Auto-Broadcast Join Threshold

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "50MB")

4. Frequent Query Failures or Permission Errors on Shared Tables

Symptoms:

  • Error: “Table not found in Unity Catalog.”
  • Error: “Permission denied when accessing shared table.”
  • Queries randomly fail due to metadata synchronization issues.

Causes:

  • The table owner may have revoked access to the shared table.
  • Unity Catalog permissions are misconfigured for the shared table.
  • Cluster configuration does not support Unity Catalog.

Fix:
Verify That the Shared Table Exists in Unity Catalog

SHOW TABLES IN shared_catalog.shared_schema;

Check Permissions for Shared Table Access

SHOW GRANTS ON TABLE shared_catalog.shared_schema.shared_table;

Request the Table Owner to Re-Share the Table If Permissions Were Revoked

GRANT SELECT ON TABLE shared_catalog.shared_schema.shared_table TO `user@example.com`;

5. Slow Query Execution Due to Small File Problem

Symptoms:

  • Queries on shared tables take longer than expected even for small datasets.
  • Frequent shuffle operations slow down query execution.

Causes:

  • The shared table contains too many small files, causing excessive metadata operations.
  • High shuffle overhead due to small partitions.

Fix:
Compact Small Files Using OPTIMIZE

  • If the table owner allows, run OPTIMIZE to merge small files:
OPTIMIZE shared_catalog.shared_schema.shared_table ZORDER BY (timestamp);

Coalesce Data Before Writing Queries on Shared Tables

df.coalesce(10).write.format("delta").save("/mnt/delta/optimized_table")

Increase Shuffle Partitions for Better Performance

spark.conf.set("spark.sql.shuffle.partitions", "200")

Step-by-Step Troubleshooting Guide

1. Check Query Performance Metrics in Spark UI

  • Go to Databricks UI → Spark UI → SQL Query Details to identify slow operations.

2. Verify If Delta Caching Is Enabled for Shared Tables

spark.conf.get("spark.databricks.io.cache.enabled")

3. Test Storage Latency for Shared Tables

  • Run a simple timing test to see how long it takes to read from shared storage:
import time
start = time.time()
df = spark.read.table("shared_catalog.shared_schema.shared_table")
df.count()
end = time.time()
print(f"Query Time: {end - start} seconds")

4. Check Network Latency to Cloud Storage

ping storage-account.blob.core.windows.net

Best Practices for Querying Shared Tables in Databricks

Enable Delta Caching for Faster Reads

spark.conf.set("spark.databricks.io.cache.enabled", "true")

Optimize Table Storage Using OPTIMIZE and ZORDER

OPTIMIZE shared_catalog.shared_schema.shared_table ZORDER BY (primary_key);

Use Broadcast Joins When Possible

df_small = broadcast(spark.read.table("shared_catalog.shared_schema.small_table"))

Monitor Query Performance Using Databricks Spark UI

  • Identify slow queries and optimize accordingly.

Conclusion

Querying shared tables in Databricks can lead to performance bottlenecks due to network latency, lack of caching, data partitioning issues, and permission constraints. By following these optimization techniques, teams can significantly improve query speed and efficiency on shared datasets.

guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x