Introduction
A PYSPARK001 error indicates that a Python notebook in Databricks crashed due to an Out of Memory (OOM) condition. This occurs when the driver or executor nodes run out of available memory, leading to job failure and notebook termination.
🚨 Common symptoms of PYSPARK001 – OOM errors:
- Notebook crashes abruptly with an OOM error message.
- Job stages are slow or fail with “java.lang.OutOfMemoryError: Java heap space”.
- Driver node becomes unresponsive or restarts.
- High GC (Garbage Collection) times and memory spikes in Spark UI.
This guide covers common causes, troubleshooting steps, and best practices to prevent and resolve PYSPARK001 errors in Databricks.
Common Causes and Fixes for OOM in Databricks Notebooks
1. Collecting Large DataFrames on the Driver
Symptoms:
- Notebook crashes after
.collect()
or.toPandas()
call. - Driver memory spikes and then crashes.
- Error: “Out of memory on driver node.”
Causes:
.collect()
or.toPandas()
loads the entire DataFrame into the driver memory, causing an OOM.- Large datasets cannot fit into the available memory on the driver.
Fix:
✅ Avoid using .collect()
or .toPandas()
on large DataFrames:
df = spark.range(1000000000) # Large dataset
df.collect() # 🚨 BAD – Loads the entire DataFrame into driver memory
✅ Use .show()
or .limit()
to preview the data:
df.show(10) # ✅ GOOD – Displays only a sample
✅ If necessary, export data in chunks to reduce memory usage:
for batch in df.limit(1000).collect():
process_batch(batch)
✅ Write data to a file instead of collecting it in memory:
df.write.format("parquet").save("/mnt/output/large_data/")
2. Inefficient Joins and Shuffles
Symptoms:
- Job hangs or fails during join or shuffle operations.
- Executor memory usage spikes before the crash.
- Error: “java.lang.OutOfMemoryError: Java heap space.”
Causes:
- Broadcast joins on large tables exceed the driver’s memory.
- Shuffles produce large intermediate data, causing memory pressure.
Fix:
✅ Use broadcast joins only for small tables:
from pyspark.sql.functions import broadcast
df = large_df.join(broadcast(small_df), "id")
✅ Increase shuffle partitions to reduce memory usage per task:
spark.conf.set("spark.sql.shuffle.partitions", "200")
✅ Use repartition()
to balance data distribution:
df = df.repartition(100) # Reduces shuffle size and improves performance
3. High Driver Memory Usage Due to Cached Data
Symptoms:
- Driver memory usage increases over time, eventually leading to a crash.
- Notebook becomes slow and unresponsive.
- Error: “Out of memory due to cached DataFrame.”
Causes:
- Large DataFrames cached in memory without eviction.
- Multiple cached DataFrames competing for limited memory.
Fix:
✅ Monitor cached data using Spark UI:
df.cache()
df.count() # Caches the DataFrame in memory
✅ Unpersist unused cached DataFrames:
df.unpersist()
✅ Use persist()
with appropriate storage levels:
df.persist(StorageLevel.DISK_ONLY) # Stores data on disk to reduce memory usage
4. Insufficient Cluster Resources
Symptoms:
- OOM errors on both driver and executors.
- Job fails with “Container killed by YARN for exceeding memory limits”.
- Cluster scaling fails to prevent the error.
Causes:
- Insufficient memory allocated to driver or executors.
- Job requires more resources than available in the cluster.
Fix:
✅ Increase driver and executor memory:
- Go to Cluster Settings → Advanced Options → Spark Config
- Add the following:
{
"spark.driver.memory": "8g",
"spark.executor.memory": "16g"
}
✅ Enable auto-scaling for dynamic resource allocation:
{
"spark.dynamicAllocation.enabled": "true"
}
✅ Use larger instance types with more memory and CPUs.
5. Data Skew and Imbalanced Partitions
Symptoms:
- Some tasks take significantly longer, causing driver crashes.
- One or two partitions consume all the memory, while others remain idle.
Causes:
- Data skew causes a few partitions to contain most of the data.
- Imbalanced partitions overload individual executors.
Fix:
✅ Identify skewed keys and repartition data:
df.groupBy("skewed_column").count().show()
✅ Use salting
to distribute data more evenly:
from pyspark.sql.functions import col, concat, lit
df = df.withColumn("salted_key", concat(col("key"), lit("_"), col("partition_id")))
✅ Use coalesce()
or repartition()
to adjust partition sizes:
df = df.repartition(100)
6. Improper Python Object Serialization
Symptoms:
- Job fails with “Task not serializable” or “Pickle serialization error”.
- OOM errors occur when passing large objects to Spark tasks.
Causes:
- Large Python objects passed between Spark tasks.
- Improper serialization of custom Python objects.
Fix:
✅ Avoid passing large objects to Spark tasks:
# Avoid passing large datasets directly
rdd = sc.parallelize(large_dataset) # ✅ GOOD
✅ Use broadcast variables for large read-only objects:
bc_var = sc.broadcast(large_dict)
Step-by-Step Troubleshooting Guide
1. Check Spark UI for Memory Usage
- Go to Spark UI → Executors and monitor memory and GC times.
- Look for executors with high memory usage or frequent garbage collection.
2. Review Logs for OOM Errors
- Go to Cluster Logs → Driver Logs and check for:
java.lang.OutOfMemoryError: Java heap space
3. Optimize Data Processing Steps
- Avoid collecting large datasets on the driver.
- Repartition data for balanced processing.
4. Increase Cluster Resources
- Add more memory and CPUs to the driver and executors.
- Enable auto-scaling for resource flexibility.
Best Practices to Prevent OOM in Databricks
✅ Use Distributed Operations Instead of Collecting Data
- Avoid
.collect()
and.toPandas()
on large datasets.
✅ Optimize Join and Shuffle Operations
- Use broadcast joins for small tables.
- Repartition data to reduce shuffle size.
✅ Monitor and Tune Memory Usage
- Cache only essential DataFrames.
- Increase driver and executor memory as needed.
✅ Use Auto-Scaling Clusters
- Allow Databricks to dynamically adjust cluster size based on workload.
Conclusion
PYSPARK001 – Python notebook crashes due to OOM errors can severely impact job performance and stability. By optimizing memory usage, tuning Spark configurations, and distributing workloads properly, teams can prevent crashes and ensure smooth execution in Databricks.