,

SPARK002 – Out of Memory (OOM) Error in Databricks

Posted by

Introduction

The SPARK002 – Out of Memory (OOM) error in Databricks occurs when a Spark executor or driver runs out of memory while processing data. This can cause job failures, slow execution, cluster crashes, or unresponsive notebooks.

🚨 Common symptoms of SPARK002 OOM errors:

  • Job fails with java.lang.OutOfMemoryError: Java heap space.
  • Job runs slowly and then crashes.
  • Executor logs show ExecutorLostFailure messages.
  • Driver logs show excessive garbage collection (GC) times.

This guide covers common causes of OOM errors, troubleshooting steps, and best practices to optimize memory usage in Databricks Spark clusters.


1. Understanding Memory Usage in Databricks

In Spark, memory is divided into:

  • Driver Memory: Manages job coordination and small data collections.
  • Executor Memory: Handles distributed computations and caching.
  • Shuffle Memory: Used for sorting, aggregations, and joins.

💡 SPARK002 errors occur when:

  • Executors run out of heap memory while processing large datasets.
  • The driver collects too much data, exceeding its allocated memory.
  • Shuffle operations (JOIN, GROUP BY, SORT) require more memory than available.

2. Common Causes of SPARK002 OOM Errors and Fixes

1. Too Much Data Collected on the Driver (collect() Overload)

Symptoms:

  • Error: “java.lang.OutOfMemoryError: Java heap space”
  • The notebook crashes when calling .collect() on a large dataset.

Causes:

  • .collect() brings all data to the driver, causing a memory overflow.
  • Large DataFrames are not processed in a distributed manner.

Fix:
Avoid using .collect() for large datasets:

df.show(10)  # ✅ Instead of df.collect()

Use .take(n) or .limit(n) instead of .collect():

df.limit(100).show()  # ✅ Only fetches 100 rows instead of entire dataset

Write results to storage instead of collecting them:

df.write.format("delta").save("/mnt/delta/output")

2. Memory-Intensive Operations (JOIN, GROUP BY, SORT)

Symptoms:

  • Error: “ExecutorLostFailure: Container killed due to exceeding memory limits.”
  • Jobs fail during JOIN or GROUP BY operations.

Causes:

  • Large joins require shuffle memory, which may exceed executor limits.
  • Sorting operations spill to disk, leading to performance issues.

Fix:
Use Broadcast Joins for small tables:

from pyspark.sql.functions import broadcast

df_large = spark.read.parquet("s3://large-dataset/")
df_small = spark.read.parquet("s3://small-dataset/")

df_join = df_large.join(broadcast(df_small), "id")  # ✅ Broadcast small table

Increase shuffle memory to prevent excessive spilling:

{
  "spark.sql.shuffle.partitions": "200"
}

Use repartitioning to optimize shuffle operations:

df = df.repartition(50, "id")  # ✅ Optimized shuffle for large datasets

3. Too Many Small Partitions (High Overhead in Memory Allocation)

Symptoms:

  • Excessive GC (Garbage Collection) slows down jobs.
  • Cluster runs out of memory even for moderate datasets.

Causes:

  • Small partitions increase memory overhead, as each task requires memory.
  • Default partitions may be too large or too small.

Fix:
Repartition datasets before processing:

df = df.repartition(100)  # ✅ Reduces overhead by consolidating partitions

Use coalesce() for reducing memory pressure on small operations:

df = df.coalesce(10)  # ✅ Reduces number of partitions before writing data

4. Insufficient Cluster Memory Allocation

Symptoms:

  • Jobs crash even for small datasets.
  • Cluster memory usage reaches 100%.

Causes:

  • Executors are not allocated enough memory to process workloads.
  • The cluster does not scale to handle large datasets.

Fix:
Increase executor memory allocation:

{
  "spark.executor.memory": "8g",
  "spark.executor.memoryOverhead": "2g"
}

Enable autoscaling to add resources dynamically:

{
  "autoscale.min_workers": 2,
  "autoscale.max_workers": 10
}

Use a larger cluster for high-memory workloads.


5. High Garbage Collection (GC) Overhead

Symptoms:

  • Executor logs show long GC pauses (over 60% of execution time).
  • Frequent task retries due to memory pressure.

Causes:

  • Excessive small object creation leading to memory fragmentation.
  • Large RDD/DataFrame caching fills up executor memory.

Fix:
Reduce caching of unnecessary objects:

df.unpersist()  # ✅ Clears cached DataFrame from memory

Enable G1GC (Garbage First Garbage Collector) for better memory management:

{
  "spark.executor.extraJavaOptions": "-XX:+UseG1GC"
}

Increase executor memory to reduce GC frequency.


6. Too Many Cached DataFrames Filling Memory

Symptoms:

  • Notebook runs out of memory after multiple queries.
  • Job slows down due to excessive caching.

Causes:

  • DataFrames are cached unnecessarily, consuming memory.
  • Cached data is not cleared between operations.

Fix:
Clear cache before running large queries:

spark.catalog.clearCache()

Persist only necessary DataFrames:

df.persist()  # ✅ Use persist() instead of cache() for controlled memory usage

3. Step-by-Step Troubleshooting Guide

1. Check Cluster Memory Usage

Go to Databricks UI → Clusters → Metrics and verify executor and driver memory usage.

2. Identify Large Memory-Consuming Operations

Run:

df.explain(True)  # ✅ Shows query execution plan

Look for expensive operations like shuffle, join, sort, or collect.

3. Monitor Garbage Collection Logs

Run:

cat /dbfs/cluster-logs/<cluster-id>/driver/log4j-active.log

Check for long GC pauses (> 50% of execution time).

4. Tune Spark Memory Configuration

Modify executor and driver memory settings:

{
  "spark.driver.memory": "8g",
  "spark.executor.memory": "12g",
  "spark.executor.memoryOverhead": "4g"
}

5. Reduce Partition Count for Large Data

df = df.repartition(200)

6. Enable Autoscaling for Large Jobs

{
  "autoscale.min_workers": 2,
  "autoscale.max_workers": 10
}

4. Best Practices to Avoid SPARK002 OOM Errors

Use show() Instead of collect()

df.show(10)  # ✅ Prevents memory overload on the driver

Optimize Shuffle and Joins Using Broadcast

df_large.join(broadcast(df_small), "id")

Use Autoscaling for Memory-Intensive Jobs

{
  "autoscale.max_workers": 10
}

Monitor Memory Usage and Optimize Cluster Size

top -o %MEM

Persist Only Necessary Data and Clear Cache

spark.catalog.clearCache()

5. Conclusion

The SPARK002 Out of Memory (OOM) error in Databricks occurs due to excessive memory usage in Spark executors or the driver. By optimizing memory allocation, reducing shuffle operations, enabling autoscaling, and using efficient data partitioning, you can prevent job failures and improve cluster performance.

guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x