Driver Node Crashes in Databricks: Causes, Diagnosis, and Solutions

Mohammad Gufran Jahangir January 31, 2025 0

Table of Contents

Introduction

In Databricks, the driver node is responsible for orchestrating Spark tasks, managing cluster state, and handling user interactions. If the driver node crashes, jobs may fail mid-execution, run slowly, or become unresponsive.

Common symptoms of driver node crashes:

Job fails with DriverLost or Task not serializable error.
Databricks UI becomes unresponsive, and logs stop updating.
OOM (Out of Memory) errors or JVM heap exhaustion.
Cluster restarts unexpectedly or enters a terminated state.

This guide covers common causes of driver node failures, troubleshooting steps, and best practices to prevent crashes in Databricks.

Understanding the Role of the Driver Node

The driver node performs the following tasks in a Databricks cluster:

Manages SparkContext – Tracks job execution and coordinates worker nodes.
Handles user-defined code – Runs the main logic of notebooks or jobs.
Collects results – Returns query results to users.
Maintains metadata and state – Stores cached data and manages execution plans.

🚨 A driver crash disrupts all running jobs and leads to data loss if not checkpointed.

Common Causes of Driver Node Crashes and Fixes

1. Out of Memory (OOM) Errors on the Driver

Symptoms:

Error: “java.lang.OutOfMemoryError: Java heap space”
Notebook execution stops abruptly.
Driver runs out of JVM heap memory due to large data collection.

Causes:

Too much data collected on the driver (.collect() or toPandas()).
High-memory operations (e.g., large DataFrame caching, broadcast joins).
Inefficient garbage collection settings for the JVM.

Fix:
✅ Avoid collecting large datasets on the driver:

df = spark.range(1000000000)  # Large dataset
df.collect()  # 🚨 BAD - loads all data into the driver memory

✅ Use distributed Spark transformations instead of collecting data:

df.show(10)  # ✅ GOOD - Displays only a small sample
df.write.format("delta").save("/mnt/delta/output")  # ✅ GOOD - Saves data without collecting

✅ Increase driver memory allocation:

{
  "spark.driver.memory": "8g",
  "spark.driver.maxResultSize": "4g"
}

2. Too Many Parallel Tasks Overloading the Driver

Symptoms:

Error: “Executor lost task but is not blacklisted.”
Slow driver response or UI freezing.
Jobs stall without completing.

Causes:

High parallelism settings (spark.default.parallelism) overwhelming the driver.
Too many simultaneous tasks queued on the driver.
Large shuffle operations causing memory bottlenecks.

Fix:
✅ Reduce driver-side parallelism for better load balancing:

{
  "spark.default.parallelism": "100",
  "spark.sql.shuffle.partitions": "200"
}

✅ Use coalesce() to limit unnecessary parallelism in small jobs:

df = df.coalesce(10)  # ✅ Reduces shuffle size

✅ Distribute work to worker nodes instead of keeping it on the driver.

3. Unoptimized Broadcast Joins Causing Driver Memory Exhaustion

Symptoms:

Error: “Broadcast memory limit exceeded on driver.”
Spark job fails during join operations.

Causes:

Auto-broadcasting large tables, leading to memory overflow.
Inefficient Spark SQL join execution plans.

Fix:
✅ Disable automatic broadcast joins for large tables:

{
  "spark.sql.autoBroadcastJoinThreshold": "-1"
}

✅ Manually control broadcast joins using broadcast() only when appropriate:

from pyspark.sql.functions import broadcast
df_small = broadcast(df_small)  # ✅ Only broadcast small tables
df_large.join(df_small, "id")

4. Long-Running Jobs Holding Driver Resources

Symptoms:

The driver remains occupied for hours, causing job failures.
Jobs freeze at a specific stage for a long time.
Executor heartbeat lost error appears in logs.

Causes:

Jobs using long-running transformations (e.g., complex aggregations, window functions).
No checkpointing or intermediate storage, causing excessive memory usage.
Infinite loops in Spark jobs keeping the driver busy.

Fix:
✅ Use checkpointing to free up memory periodically:

df = df.checkpoint()

✅ Break down long-running jobs into multiple stages.
✅ Monitor Spark UI for stuck jobs and optimize accordingly.

5. Driver Node Failing Due to Network or Storage Issues

Symptoms:

Error: “Driver terminated unexpectedly due to network failure.”
Cluster disconnects from Databricks control plane.
DBFS mounts fail, causing job crashes.

Causes:

Network misconfigurations in VPC/VNet setups.
Slow DBFS storage or unresponsive external data sources.
Lost connection to the metastore, breaking queries.

Fix:
✅ Ensure DBFS and cloud storage are accessible before running jobs:

dbutils.fs.ls("dbfs:/mnt/mybucket/")

✅ Check for network interruptions using connectivity tests:

ping <databricks-endpoint>
nc -zv storage-account.blob.core.windows.net 443

✅ Use AWS PrivateLink or Azure Private Endpoints for stable storage connections.

Step-by-Step Troubleshooting Guide

1. Check Driver Logs for Errors

Go to Databricks UI → Clusters → Driver Logs → View Logs.
Look for OutOfMemoryError, task failures, or timeout issues.

2. Monitor Resource Utilization

Go to Spark UI → Executors to check driver memory and CPU usage.
Look for high GC time or excessive task scheduling on the driver.

3. Test Network and Storage Stability

Verify external data sources (S3, ADLS, JDBC connections) are reachable.
Run network tests using:

curl -I https://<storage-endpoint>

4. Optimize Job Configuration and Parallelism

Reduce spark.default.parallelism and spark.sql.shuffle.partitions.
Disable broadcast joins for large datasets.

Best Practices to Prevent Driver Node Crashes

✅ Use Worker Nodes for Heavy Computation

Avoid running large jobs directly on the driver.
Push intensive tasks to executors using .foreachPartition().

✅ Monitor Driver Memory Usage

Use Databricks Ganglia UI to track driver memory and CPU utilization.

✅ Limit `collect()` and Avoid Large Data Fetches

Never use .collect() on large DataFrames; use .show() or .limit().

✅ Optimize Shuffle and Parallelism Settings

Reduce shuffle partitions to avoid excessive memory usage:

{
  "spark.sql.shuffle.partitions": "200"
}

✅ Enable Auto-Restart for Critical Workloads

Set up Databricks cluster auto-restart policies to recover from driver crashes.

Real-World Example: Fixing a Driver Crash in Databricks

Scenario:

A machine learning job processing 1TB of data crashed repeatedly with OutOfMemoryError.

Root Cause:

The job collected all data to the driver using .collect().
The driver ran out of heap memory and crashed.

Solution:

Replaced .collect() with .show(10).
Increased driver memory to 16GB (spark.driver.memory = "16g").
Distributed computation across worker nodes instead of using the driver.

✅ Result: Job executed successfully without crashing.

Conclusion

Driver node crashes in Databricks typically occur due to memory overload, excessive parallelism, inefficient joins, or network issues. By optimizing job execution, using efficient memory settings, and distributing workloads properly, teams can prevent failures and ensure stable cluster performance.

Mohammad Gufran Jahangir

Tags: Databricks, Driver Node Crashes

Category:

Driver Node Crashes in Databricks: Causes, Diagnosis, and Solutions

Introduction

Understanding the Role of the Driver Node

Common Causes of Driver Node Crashes and Fixes

1. Out of Memory (OOM) Errors on the Driver

2. Too Many Parallel Tasks Overloading the Driver

3. Unoptimized Broadcast Joins Causing Driver Memory Exhaustion

4. Long-Running Jobs Holding Driver Resources

5. Driver Node Failing Due to Network or Storage Issues

Step-by-Step Troubleshooting Guide

1. Check Driver Logs for Errors

2. Monitor Resource Utilization

3. Test Network and Storage Stability

4. Optimize Job Configuration and Parallelism

Best Practices to Prevent Driver Node Crashes

✅ Use Worker Nodes for Heavy Computation

✅ Monitor Driver Memory Usage

✅ Limit collect() and Avoid Large Data Fetches

✅ Optimize Shuffle and Parallelism Settings

✅ Enable Auto-Restart for Critical Workloads

Real-World Example: Fixing a Driver Crash in Databricks

Scenario:

Root Cause:

Solution:

Conclusion

✅ Limit `collect()` and Avoid Large Data Fetches