SQL002 – Query Execution Timeout in Databricks

Mohammad Gufran Jahangir February 5, 2025 0

Introduction

The SQL002 – Query Execution Timeout error in Databricks occurs when a SQL query takes too long to execute and exceeds the configured timeout threshold. This can happen due to long-running queries, inefficient joins, large datasets, or resource constraints on the cluster.

🚨 Common symptoms of SQL002 timeout errors:

Query fails after a specific duration.
SQL warehouse or cluster becomes unresponsive.
Databricks UI shows “SQL002 – Query execution timeout” in the query history.
Long-running aggregations, joins, or queries over large datasets fail.

This guide will cover the causes, troubleshooting steps, and best practices to fix and prevent SQL execution timeouts in Databricks.

1. Identify Query Execution Timeout Threshold

Symptoms:

SQL queries fail after a fixed period (e.g., 10 or 30 minutes).
Query execution logs show SQL002 timeout errors.

Causes:

Databricks SQL warehouses have a default execution timeout (e.g., 10 minutes).
Long-running queries exceed the configured timeout limit.
Clusters with low resources take too long to execute queries.

Fix:

✅ Check the current timeout settings:

SHOW PARAMETERS LIKE 'query_timeout';

✅ Increase the query execution timeout in Databricks SQL Warehouses:

Go to Databricks UI → SQL Warehouses.
Select your SQL warehouse.
Increase the “Query Timeout (seconds)” value.

✅ If using Spark SQL, set a higher timeout threshold:

SET spark.sql.queryTimeout = 1800;  -- Set timeout to 30 minutes

✅ For notebook queries, increase execution timeout:

spark.conf.set("spark.sql.queryTimeout", "3600")  # 1-hour timeout

2. Optimize Long-Running Queries

Symptoms:

Complex queries (JOINs, aggregations) take too long to execute.
Query execution time increases as data grows.

Causes:

Poorly optimized queries cause unnecessary computation.
Large datasets require better indexing and partitioning.
Expensive joins without proper indexing slow down execution.

Fix:

✅ Use EXPLAIN to analyze query performance:

EXPLAIN SELECT * FROM sales WHERE customer_id = 12345;

✅ Rewrite queries to avoid full table scans:

SELECT * FROM sales WHERE event_date >= '2024-01-01';

✅ Use ZORDER BY to optimize frequently queried columns:

OPTIMIZE sales ZORDER BY (customer_id);

✅ Use BROADCAST JOIN for smaller tables to improve performance:

SELECT /*+ BROADCAST(small_table) */ * FROM large_table JOIN small_table ON large_table.id = small_table.id;

3. Increase Cluster Resources for Faster Execution

Symptoms:

Queries time out even with optimized execution plans.
Clusters run out of memory or CPU while executing queries.

Causes:

Cluster size is too small for large queries.
SQL warehouse does not scale automatically to handle query loads.
Concurrency limits cause multiple queries to compete for resources.

Fix:

✅ Use a larger SQL Warehouse or increase worker nodes:

Go to Databricks UI → SQL Warehouses.
Select the SQL Warehouse and increase the scaling limits.
Set auto-scaling to adjust based on load.

✅ For clusters, increase compute resources dynamically:

{
  "autoscale": {
    "min_workers": 2,
    "max_workers": 8
  }
}

✅ If running on Spark, increase memory and partitions:

spark.conf.set("spark.sql.shuffle.partitions", "500")
spark.conf.set("spark.executor.memory", "8g")

4. Reduce Data Volume Using Partition Pruning

Symptoms:

Query execution times out when scanning large tables.
Query performance is slow due to full-table scans.

Causes:

Queries scan the entire dataset instead of filtering efficiently.
No partition pruning is applied, leading to unnecessary computation.

Fix:

✅ Ensure queries use partition pruning:

SELECT * FROM sales WHERE event_date = '2024-01-01';

✅ Partition large tables to improve performance:

CREATE TABLE sales (customer_id STRING, total_amount DOUBLE, event_date DATE)
USING DELTA
PARTITIONED BY (event_date);

✅ Enable Dynamic Partition Pruning:

SET spark.sql.optimizer.dynamicPartitionPruning = true;

5. Handle Concurrent Queries Efficiently

Symptoms:

Multiple users running queries cause execution timeouts.
Databricks SQL Warehouse queues queries due to concurrency limits.

Causes:

Too many concurrent queries overload the SQL warehouse.
Clusters do not auto-scale fast enough to handle increased queries.

Fix:

✅ Monitor query concurrency in SQL Warehouse:

SHOW STATUS FOR WAREHOUSE my_warehouse;

✅ Increase the max concurrency limit for Databricks SQL Warehouse:

Go to Databricks UI → SQL Warehouses.
Increase the “Max Concurrency” setting.

✅ Use query scheduling to balance workload:

Use Databricks Workflows to schedule heavy queries during off-peak hours.

6. Retry Queries Automatically When They Time Out

Symptoms:

Intermittent query failures due to temporary overloads.
Query works sometimes but fails with SQL002 at other times.

Causes:

Transient cluster or network issues cause failures.
Databricks SQL Warehouse throttles requests under heavy load.

Fix:

✅ Implement retry logic in Python for SQL queries:

import time
from pyspark.sql import SparkSession

def execute_query_with_retry(query, max_retries=5):
    spark = SparkSession.builder.getOrCreate()
    for i in range(max_retries):
        try:
            return spark.sql(query)
        except Exception as e:
            print(f"Query failed: {e}. Retrying in {2**i} seconds...")
            time.sleep(2**i)
    raise Exception("Max retries reached. Query failed.")

query = "SELECT * FROM sales WHERE event_date = '2024-01-01'"
df = execute_query_with_retry(query)
df.show()

✅ Increase timeouts for JDBC/ODBC connections:

{
  "connectionTimeout": 120,
  "queryTimeout": 600
}

7. Check for Databricks Service Outages

Symptoms:

Query timeouts occur suddenly without changes in workload.
Multiple users experience similar timeouts at the same time.

Causes:

Databricks backend service issues or outages can impact SQL execution.
Cloud networking issues may slow down query execution.

Fix:

✅ Check Databricks status page for service outages:

Databricks Status Page

✅ Run a test query to check database response time:

SELECT NOW();

Step-by-Step Troubleshooting Guide

Step 1: Check Query Execution Plan and Optimize Performance

EXPLAIN SELECT * FROM large_table WHERE event_date = '2024-01-01';

Step 2: Increase Query Timeout Settings

SET spark.sql.queryTimeout = 1800;  -- Increase timeout to 30 minutes

Step 3: Improve Cluster Performance

Increase worker nodes and enable auto-scaling.

Step 4: Partition Large Tables for Faster Query Execution

CREATE TABLE orders USING DELTA PARTITIONED BY (order_date);

Step 5: Use Query Caching for Repeated Queries

SET spark.databricks.io.cache.enabled = true;

Best Practices to Prevent SQL Query Timeouts

✅ Optimize queries using partition pruning and indexing.
✅ Use auto-scaling clusters to handle large workloads.
✅ Enable result caching to reduce repeated query execution time.
✅ Increase SQL Warehouse timeout settings for long-running queries.
✅ Use retry logic to handle transient failures.

Conclusion

The SQL002 – Query Execution Timeout error in Databricks occurs due to long-running queries, inefficient joins, large datasets, or insufficient cluster resources. By optimizing queries, increasing timeouts, improving cluster configurations, and using retry mechanisms, you can prevent query timeouts and improve execution performance in Databricks.

Mohammad Gufran Jahangir

Tags: Databricks

Category: