Diagnosing and Fixing Slow Query Performance in Databricks

Mohammad Gufran Jahangir January 29, 2025 0

Table of Contents

Introduction

Slow query performance in Databricks can turn even simple data tasks into frustrating bottlenecks. Whether you’re running ad-hoc analytics, training ML models, or processing streaming data, inefficient queries waste time, inflate costs, and delay insights. In this guide, we’ll explore the root causes of slow queries, provide actionable fixes, and share best practices to optimize performance in your Databricks workflows.

Why Do Queries Slow Down in Databricks?

Databricks leverages Apache Spark under the hood, and slow queries often stem from Spark’s execution model, data architecture, or resource allocation. Common symptoms include:

Jobs taking hours instead of minutes.
High executor CPU/memory usage with little progress.
Tasks stuck on a single stage (e.g., “Stage 3/4 runs indefinitely”).

Top Causes of Slow Query Performance (and How to Fix Them)

1. Data Skew

What Happens: Uneven data distribution across partitions causes a few tasks to process massive datasets while others sit idle.
Example: A GROUP BY on a column with skewed values (e.g., country where 90% of records are from the USA).

Fix:

Use salting to redistribute skewed keys:

SELECT salted_key, COUNT(*)  
FROM (SELECT CONCAT(country, '_', FLOOR(RAND() * 10)) AS salted_key FROM table)  
GROUP BY salted_key;

Enable Adaptive Query Execution (AQE) in Spark 3.0+ to auto-optimize skewed joins.

2. Small File Problem

What Happens: Thousands of tiny files (e.g., <1MB Parquet/Delta files) overwhelm Spark’s metadata handling, slowing reads.
Example: Streaming jobs writing frequent small batches to Delta Lake.

Fix:

Compact files using OPTIMIZE in Delta Lake:

OPTIMIZE delta.`/path/to/table`  
ZORDER BY (timestamp);

Configure streaming jobs to write larger batches (e.g., trigger(processingTime='5 minutes')).

3. Inefficient Query Logic

What Happens: Poorly written SQL or DataFrame operations force unnecessary shuffles or full scans.
Example:

SELECT * FROM table without filters.
Cartesian joins (e.g., CROSS JOIN without constraints).

Fix:

Avoid SELECT *—fetch only required columns.
Use predicate pushdown with partition filters:

SELECT * FROM table WHERE date = '2024-01-01';

Replace CROSS JOIN with broadcast joins for small tables.

4. Suboptimal Spark Configurations

What Happens: Default Spark settings may not match your workload, leading to underutilized resources or OOM errors.

Key Tuning Parameters:

spark.sql.shuffle.partitions: Set to 2–3x the number of cores (default 200 is often too high).
spark.sql.adaptive.enabled: Set to true for AQE (auto-optimizes joins and partitions).
spark.executor.memory: Avoid over-allocating memory (e.g., leave 10% overhead for OS).

Example Cluster Config:

spark.conf.set("spark.sql.shuffle.partitions", 64)  
spark.conf.set("spark.sql.adaptive.enabled", "true")

5. Lack of Caching

What Happens: Repeatedly reading the same data from cloud storage (S3, ADLS) adds latency.

Fix:

Cache frequently used DataFrames in memory:

df = spark.read.parquet("s3://path/to/data")  
df.cache()

Use Delta Lake caching for faster metadata operations.

6. Resource Bottlenecks

What Happens: Clusters are under-provisioned (too few executors) or over-provisioned (wasting resources).

Fix:

Use autoscaling to dynamically adjust workers.
Right-size clusters:
- Compute-heavy jobs: Use memory-optimized instances.
- Small jobs: Single-node clusters with photon acceleration.

Troubleshooting Slow Queries: Step-by-Step

Check the Spark UI
- Open the Spark UI from the Databricks job/cluster dashboard.
- Look for:
  - Long-running tasks (indicating skew).
  - Large shuffle sizes (e.g., 100GB+ spilled to disk).
  - Stages with few tasks (e.g., 1–2 tasks).
Analyze the Query Plan
- Run EXPLAIN FORMATTED on your SQL query to see the execution plan:

EXPLAIN FORMATTED  
SELECT * FROM table WHERE column = 'value';

Look for Scan (full table scans) or SortMergeJoin (inefficient joins).

Best Practices for Optimized Performance

Partition Data Strategically:
- Partition Delta tables by date or high-cardinality columns.
- Avoid over-partitioning (e.g., 1,000+ partitions).
Use Delta Lake Features:
- Z-Ordering: Co-locate related data with ZORDER BY (e.g., user_id + timestamp).
- VACUUM: Clean up stale files (but respect retention periods).
Leverage Broadcast Joins:
- For small tables (<10MB), force broadcast:

SELECT /*+ BROADCAST(small_table) */ *  
FROM large_table  
JOIN small_table ON large_table.id = small_table.id;

4. Monitor and Tune Regularly:

Use Databricks Lakeview Monitoring to track query history and performance trends.

Real-World Example: Fixing a Skewed Join

Scenario: A JOIN on user_id took 4+ hours due to skewed user activity.

Steps Taken:

Identified skew using Spark UI (1 task took 3 hours; others took 5 minutes).
Applied salting to redistribute keys:

from pyspark.sql.functions import concat, lit, rand  
salted_large_df = large_df.withColumn("salted_id", concat("user_id", lit("_"), (rand() * 10).cast("int")))  
salted_small_df = small_df.withColumn("salted_id", concat("user_id", lit("_"), (rand() * 10).cast("int")))  
joined_df = salted_large_df.join(salted_small_df, "salted_id")

3.Reduced runtime to 30 minutes.

Conclusion

Slow queries in Databricks are rarely unsolvable—they often boil down to data distribution, Spark configurations, or query design. By leveraging built-in tools like the Spark UI, Delta Lake optimizations, and adaptive execution, you can transform sluggish pipelines into high-performance workflows.

Mohammad Gufran Jahangir

Tags: Apache Spark, Data Skew, Databricks, Query Optimization

Category: