PYSPARK001 – Python Notebook Crash (Out of Memory - OOM) in Databricks

Mohammad Gufran Jahangir February 11, 2025 0

Introduction

A PYSPARK001 error indicates that a Python notebook in Databricks crashed due to an Out of Memory (OOM) condition. This occurs when the driver or executor nodes run out of available memory, leading to job failure and notebook termination.

🚨 Common symptoms of PYSPARK001 – OOM errors:

Notebook crashes abruptly with an OOM error message.
Job stages are slow or fail with “java.lang.OutOfMemoryError: Java heap space”.
Driver node becomes unresponsive or restarts.
High GC (Garbage Collection) times and memory spikes in Spark UI.

This guide covers common causes, troubleshooting steps, and best practices to prevent and resolve PYSPARK001 errors in Databricks.

Common Causes and Fixes for OOM in Databricks Notebooks

1. Collecting Large DataFrames on the Driver

Symptoms:

Notebook crashes after .collect() or .toPandas() call.
Driver memory spikes and then crashes.
Error: “Out of memory on driver node.”

Causes:

.collect() or .toPandas() loads the entire DataFrame into the driver memory, causing an OOM.
Large datasets cannot fit into the available memory on the driver.

Fix:
✅ Avoid using .collect() or .toPandas() on large DataFrames:

df = spark.range(1000000000)  # Large dataset
df.collect()  # 🚨 BAD – Loads the entire DataFrame into driver memory

✅ Use .show() or .limit() to preview the data:

df.show(10)  # ✅ GOOD – Displays only a sample

✅ If necessary, export data in chunks to reduce memory usage:

for batch in df.limit(1000).collect():
    process_batch(batch)

✅ Write data to a file instead of collecting it in memory:

df.write.format("parquet").save("/mnt/output/large_data/")

2. Inefficient Joins and Shuffles

Symptoms:

Job hangs or fails during join or shuffle operations.
Executor memory usage spikes before the crash.
Error: “java.lang.OutOfMemoryError: Java heap space.”

Causes:

Broadcast joins on large tables exceed the driver’s memory.
Shuffles produce large intermediate data, causing memory pressure.

Fix:
✅ Use broadcast joins only for small tables:

from pyspark.sql.functions import broadcast
df = large_df.join(broadcast(small_df), "id")

✅ Increase shuffle partitions to reduce memory usage per task:

spark.conf.set("spark.sql.shuffle.partitions", "200")

✅ Use repartition() to balance data distribution:

df = df.repartition(100)  # Reduces shuffle size and improves performance

3. High Driver Memory Usage Due to Cached Data

Symptoms:

Driver memory usage increases over time, eventually leading to a crash.
Notebook becomes slow and unresponsive.
Error: “Out of memory due to cached DataFrame.”

Causes:

Large DataFrames cached in memory without eviction.
Multiple cached DataFrames competing for limited memory.

Fix:
✅ Monitor cached data using Spark UI:

df.cache()
df.count()  # Caches the DataFrame in memory

✅ Unpersist unused cached DataFrames:

df.unpersist()

✅ Use persist() with appropriate storage levels:

df.persist(StorageLevel.DISK_ONLY)  # Stores data on disk to reduce memory usage

4. Insufficient Cluster Resources

Symptoms:

OOM errors on both driver and executors.
Job fails with “Container killed by YARN for exceeding memory limits”.
Cluster scaling fails to prevent the error.

Causes:

Insufficient memory allocated to driver or executors.
Job requires more resources than available in the cluster.

Fix:
✅ Increase driver and executor memory:

Go to Cluster Settings → Advanced Options → Spark Config
Add the following:

{
  "spark.driver.memory": "8g",
  "spark.executor.memory": "16g"
}

✅ Enable auto-scaling for dynamic resource allocation:

{
  "spark.dynamicAllocation.enabled": "true"
}

✅ Use larger instance types with more memory and CPUs.

5. Data Skew and Imbalanced Partitions

Symptoms:

Some tasks take significantly longer, causing driver crashes.
One or two partitions consume all the memory, while others remain idle.

Causes:

Data skew causes a few partitions to contain most of the data.
Imbalanced partitions overload individual executors.

Fix:
✅ Identify skewed keys and repartition data:

df.groupBy("skewed_column").count().show()

✅ Use salting to distribute data more evenly:

from pyspark.sql.functions import col, concat, lit
df = df.withColumn("salted_key", concat(col("key"), lit("_"), col("partition_id")))

✅ Use coalesce() or repartition() to adjust partition sizes:

df = df.repartition(100)

6. Improper Python Object Serialization

Symptoms:

Job fails with “Task not serializable” or “Pickle serialization error”.
OOM errors occur when passing large objects to Spark tasks.

Causes:

Large Python objects passed between Spark tasks.
Improper serialization of custom Python objects.

Fix:
✅ Avoid passing large objects to Spark tasks:

# Avoid passing large datasets directly
rdd = sc.parallelize(large_dataset)  # ✅ GOOD

✅ Use broadcast variables for large read-only objects:

bc_var = sc.broadcast(large_dict)

Step-by-Step Troubleshooting Guide

1. Check Spark UI for Memory Usage

Go to Spark UI → Executors and monitor memory and GC times.
Look for executors with high memory usage or frequent garbage collection.

2. Review Logs for OOM Errors

Go to Cluster Logs → Driver Logs and check for:

java.lang.OutOfMemoryError: Java heap space

3. Optimize Data Processing Steps

Avoid collecting large datasets on the driver.
Repartition data for balanced processing.

4. Increase Cluster Resources

Add more memory and CPUs to the driver and executors.
Enable auto-scaling for resource flexibility.

Best Practices to Prevent OOM in Databricks

✅ Use Distributed Operations Instead of Collecting Data

Avoid .collect() and .toPandas() on large datasets.

✅ Optimize Join and Shuffle Operations

Use broadcast joins for small tables.
Repartition data to reduce shuffle size.

✅ Monitor and Tune Memory Usage

Cache only essential DataFrames.
Increase driver and executor memory as needed.

✅ Use Auto-Scaling Clusters

Allow Databricks to dynamically adjust cluster size based on workload.

Conclusion

PYSPARK001 – Python notebook crashes due to OOM errors can severely impact job performance and stability. By optimizing memory usage, tuning Spark configurations, and distributing workloads properly, teams can prevent crashes and ensure smooth execution in Databricks.

Mohammad Gufran Jahangir

Tags: Databricks

Category:

PYSPARK001 – Python Notebook Crash (Out of Memory – OOM) in Databricks

Introduction

Common Causes and Fixes for OOM in Databricks Notebooks

1. Collecting Large DataFrames on the Driver

2. Inefficient Joins and Shuffles

3. High Driver Memory Usage Due to Cached Data

4. Insufficient Cluster Resources

5. Data Skew and Imbalanced Partitions

6. Improper Python Object Serialization

Step-by-Step Troubleshooting Guide

1. Check Spark UI for Memory Usage

2. Review Logs for OOM Errors

3. Optimize Data Processing Steps

4. Increase Cluster Resources

Best Practices to Prevent OOM in Databricks

✅ Use Distributed Operations Instead of Collecting Data

✅ Optimize Join and Shuffle Operations

✅ Monitor and Tune Memory Usage

✅ Use Auto-Scaling Clusters

Conclusion