Slow Performance When Querying Unity Catalog Tables

Mohammad Gufran Jahangir February 3, 2025 0

Table of Contents

Introduction

Querying Unity Catalog tables in Databricks should be fast and efficient, but sometimes performance issues arise, such as:

🚨 Common issues:

Queries take significantly longer than expected.
High execution times for SELECT, JOIN, or AGGREGATE operations.
Slow metadata retrieval when listing tables, catalogs, or schemas.
Performance drops compared to Hive Metastore or local Delta tables.

This guide explains why Unity Catalog queries may be slow and how to optimize performance for better speed and efficiency.

1. Verify If Unity Catalog Tables Are Using Delta Format

Symptoms:

Queries run slowly compared to standard Delta tables.
Metadata operations (SHOW TABLES, SHOW COLUMNS) take time.

Causes:

Unity Catalog supports only Delta tables, but if a table was migrated from a Hive Metastore, performance may degrade due to missing optimizations.

Fix:

✅ Check the table format:

DESCRIBE DETAIL my_catalog.my_schema.my_table;

✅ If the table is not Delta, convert it:

ALTER TABLE my_catalog.my_schema.my_table CONVERT TO DELTA;

✅ Use OPTIMIZE to speed up queries after conversion:

OPTIMIZE my_catalog.my_schema.my_table ZORDER BY (primary_column);

2. Query Performance Is Slower Than Expected Due to Metadata Overhead

Symptoms:

SHOW CATALOGS, SHOW TABLES, and DESCRIBE commands are slow.
Queries using INFORMATION_SCHEMA take longer than expected.

Causes:

Unity Catalog stores metadata centrally, meaning queries first interact with the metastore before execution.
Large numbers of schemas or tables in Unity Catalog increase lookup times.

Fix:

✅ Use direct SQL queries instead of listing tables for faster access:

SELECT * FROM my_catalog.my_schema.my_table WHERE id = 100;

✅ Avoid using INFORMATION_SCHEMA.TABLES when possible.

SELECT table_name FROM my_catalog.my_schema.tables WHERE table_name LIKE 'sales%';

✅ Enable query caching to reduce metadata lookup times.

3. Query Execution Is Slow Due to Lack of Z-Ordering or File Compaction

Symptoms:

Queries with WHERE, JOIN, or ORDER BY are slow.
Partitions are read inefficiently, leading to longer execution times.
High file read times for large tables in Unity Catalog.

Causes:

Too many small files cause inefficient table scans.
Z-ordering is not enabled, leading to unoptimized data retrieval.

Fix:

✅ Run OPTIMIZE with ZORDER BY to improve query performance:

OPTIMIZE my_catalog.my_schema.my_table ZORDER BY (customer_id);

✅ For partitioned tables, ensure partitions are evenly distributed:

DESCRIBE DETAIL my_catalog.my_schema.my_table;

✅ If there are many small files, use AUTO OPTIMIZE:

ALTER TABLE my_catalog.my_schema.my_table SET TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true');

4. Queries Are Slow Due to Suboptimal Cluster Configuration

Symptoms:

Queries run faster on smaller datasets but slow down on larger ones.
High CPU or memory usage on worker nodes.
Queries crash due to out-of-memory errors.

Causes:

Insufficient cluster resources (CPU, memory) for query execution.
Auto-scaling not configured, leading to underutilized clusters.

Fix:

✅ Increase cluster size for large queries:

Databricks UI → Clusters → Edit Cluster
Set worker nodes to a higher memory and CPU configuration.

✅ Enable Auto-Scaling for Better Performance:

{
  "spark.databricks.cluster.autoscale.enabled": "true",
  "spark.databricks.io.cache.enabled": "true"
}

✅ Monitor cluster performance in the Spark UI to identify bottlenecks.

5. Query Performance Degrades Due to Inefficient Joins

Symptoms:

JOIN operations take a long time to execute.
Out-of-memory errors occur when joining large tables.
Query plans show full table scans instead of efficient lookups.

Causes:

Broadcast joins are not being used, leading to slow shuffle operations.
Inefficient indexing on the join keys.

Fix:

✅ Enable BROADCAST JOIN for small tables:

SET spark.sql.autoBroadcastJoinThreshold = -1;

✅ Check if broadcast joins are being applied:

EXPLAIN FORMATTED SELECT * FROM large_table JOIN small_table ON large_table.id = small_table.id;

✅ If possible, pre-partition the data to optimize joins:

CREATE TABLE my_catalog.my_schema.optimized_table
USING DELTA
PARTITIONED BY (region)
AS SELECT * FROM my_catalog.my_schema.my_table;

6. Query Performance Degrades Due to External Storage Latency

Symptoms:

Queries run slower on Unity Catalog tables than on local Delta tables.
Increased read latency from cloud storage (AWS S3, Azure ADLS, GCS).

Causes:

Unity Catalog uses external cloud storage, which may have higher latency than local Delta tables.
Cross-region data queries introduce network latency.

Fix:

✅ Ensure Databricks and storage are in the same region to minimize latency.
✅ Enable Databricks I/O caching to improve performance:

{
  "spark.databricks.io.cache.enabled": "true"
}

✅ Use partitioned and optimized Delta tables to reduce I/O operations.

7. Check Query Execution Plan for Bottlenecks

Symptoms:

Queries are slower than expected even after optimizations.
High shuffle or scan times when checking Spark UI logs.

Fix:

✅ Check query execution plan to identify slow steps:

EXPLAIN FORMATTED SELECT * FROM my_catalog.my_schema.my_table WHERE column = 'value';

✅ Look for expensive operations (e.g., full table scans, high shuffle times).
✅ Optimize query structure based on execution plan results.

Step-by-Step Troubleshooting Guide

Step 1: Check Table Format and Optimization Status

DESCRIBE DETAIL my_catalog.my_schema.my_table;

If the table is not Delta, convert it to Delta format.

Step 2: Optimize Data Storage

OPTIMIZE my_catalog.my_schema.my_table ZORDER BY (primary_column);

Reduces small file issues and improves indexing.

Step 3: Check Cluster Performance

Increase worker node size or enable auto-scaling.
Monitor Spark UI for high shuffle times.

Step 4: Enable Query Caching and Reduce Metadata Overhead

SET spark.databricks.io.cache.enabled = true;

Reduces metadata lookup times and improves table scans.

Step 5: Use Efficient Joins and Indexing

Enable broadcast joins for small tables.
Pre-partition large tables for faster lookups.

Best Practices for Fast Unity Catalog Queries

✅ Use Delta Format and Optimize Regularly

OPTIMIZE my_catalog.my_schema.my_table ZORDER BY (customer_id);

✅ Ensure Databricks and Storage Are in the Same Region

Reduces network latency issues.

✅ Enable Databricks I/O Caching

SET spark.databricks.io.cache.enabled = true;

✅ Monitor Query Performance Using Execution Plans

EXPLAIN FORMATTED SELECT * FROM my_catalog.my_schema.my_table;

Conclusion

If Unity Catalog queries are slow, check for:
✅ Delta table optimizations (OPTIMIZE, ZORDER).
✅ Cluster configuration and auto-scaling.
✅ Storage latency issues (use same region, enable caching).
✅ Query execution plan bottlenecks.

By following these best practices, Unity Catalog queries will run faster and more efficiently in Databricks.

Mohammad Gufran Jahangir

Tags: Databricks, Unity Catalog

Category: