What is Garbage Collection in Databricks

Mohammad Gufran Jahangir August 7, 2025 0

GC stands for Garbage Collection — it’s a process in the Java Virtual Machine (JVM) (which Apache Spark runs on) that automatically frees up memory by removing data (objects) that are no longer in use.

🔍 In Simple Terms:

When Spark runs, it loads data into memory (RAM). Once a job or task finishes using that data, it’s no longer needed. The Garbage Collector steps in to clean it up to make room for new data.

🧠 Why GC Matters in Spark:

Spark runs on JVM.
If too much data builds up in memory, Spark can pause to let GC clean up.
These GC pauses can cause jobs to slow down or fail (especially for big data).

⚠️ Signs of GC Problems:

High “GC Time” in Spark UI (Executors tab)
Long task duration even for simple operations
OutOfMemory (OOM) errors

✅ How to Reduce GC Impact:

Use the right executor memory (not too much, not too little)
Cache only what’s needed
Tune spark.sql.shuffle.partitions
Use formats like Parquet/Delta (compact data)
Use bigger instance types (fewer executors → less GC)

🧠 Garbage Collection (GC) can happen on both driver and worker nodes, but the impact and concern differ:

🧩 1. Driver Node – GC Problems

Concern Level: 🔴 CRITICAL
The driver manages the Spark application, including:
- Job coordination
- Task scheduling
- Metadata tracking
- Logging

🔍 When GC is a problem on the driver:

You may see errors like:
- "Driver is up but not responsive (likely due to GC)"
Job hangs or fails entirely
UI may freeze or become unresponsive

✅ Fix:

Increase driver memory (spark.driver.memory)
Avoid overloading the driver with large data collection operations like .collect(), .toPandas(), etc.
Optimize logging or output operations

🧩 2. Worker Nodes (Executors) – GC Problems

Concern Level: 🟠 Important
Executors do the actual computation (e.g., map, reduce, filter, joins)

🔍 When GC is a problem on executors:

Slow task execution
Frequent task retries
High shuffle spill or memory pressure
OOM (Out of Memory) errors

✅ Fix:

Increase executor memory
Optimize transformations
Reduce shuffle volume
Tune spark.sql.shuffle.partitions or enable adaptive execution
Use efficient formats (e.g., Delta, Parquet)

🧾 Summary Table:

Area	GC Concern	Impact	Solution
Driver	High	Application fails or hangs	Add memory, avoid large collect operations
Executors	Moderate	Tasks slow or fail (OOM/Spill)	Tune memory, reduce shuffle, optimize jobs

How to Estimate or Investigate Actual Memory Pressure:

Check GC Time:
- Your executors show ~6–7 minutes of GC time → not very high (good).
- High GC time (e.g., >10% of total task time) would suggest memory pressure.
Storage Memory vs Total Available:
- Executor 0: 6.5 MiB / 84.3 GiB
- So only ~0.007% of memory is used for cached data (very low).
Thread Dump (clickable):
- Can reveal memory pressure (e.g., “GC Overhead”, “OOM”, “Full GC”)

✅ What is a Thread Dump?

A thread dump is a snapshot of all threads (executing units of code) in the JVM at a specific moment. It helps identify:

CPU-heavy threads
Threads stuck in long waits (e.g., waiting for locks)
Garbage Collection delays
Deadlocks or hung jobs

🧠 Key Sections of a Thread Dump (as seen in your file):

Each thread dump contains multiple thread blocks, each starting like this:

"Thread-XYZ" #NN prio=5 os_prio=0 tid=0xNNNNN nid=0xNNNNN runnable [address]

1. Thread Name

Example: "Executor task launch worker for task 1234"

Shows what this thread is doing — e.g., handling a Spark task.
Threads named like task launch worker are typically executing Spark stages.

2. Thread State

Examples:

RUNNABLE: Actively running.
WAITING / TIMED_WAITING: Waiting for an event (e.g., data, lock, GC).
BLOCKED: Waiting for a resource locked by another thread (potential deadlock).

This is a critical field to monitor.

3. Stack Trace

Lines under each thread like:

at org.apache.spark.shuffle.sort.SortShuffleManager.getReader(...)
at java.util.concurrent.ThreadPoolExecutor.runWorker(...)

Shows the code path being executed.
Useful for locating performance bottlenecks or infinite loops.

4. Locks & Monitors

Lines like:

- waiting to lock <0x00000000dedeabcd> (a java.util.concurrent.locks.ReentrantLock)

Indicates the thread is waiting to acquire a lock, possibly causing delays.
If many threads are blocked on the same lock, this indicates contention or a bottleneck.

🔍 What to Check in a Spark (Databricks) Thread Dump

Situation	What to Look For	What It Means
❗ High GC	Many threads in `WAITING` with stack trace under `java.lang.ref.Finalizer` or GC methods	GC overhead; consider tuning memory or `spark.executor.memoryOverhead`
❗ Job Stuck	Multiple threads in `BLOCKED` on same object	Potential deadlock or high lock contention
❗ Long Shuffle	Threads stuck on `org.apache.spark.shuffle` methods	Shuffle bottleneck; check partition size or skew
❗ CPU Spike	Many threads in `RUNNABLE` with deep stack traces	Likely high CPU usage; consider lowering parallelism or repartitioning
❗ Idle Executors	Most threads in `TIMED_WAITING`	Executor is idle; might be over-provisioned

💡 When to Analyze a Thread Dump

When	Why
Job fails with “driver not responding” or “out of memory”	Diagnose memory pressure, GC or deadlocks
Performance slows down without errors	Identify contention or busy threads
Long garbage collection pauses	Spot stuck threads in GC
You suspect skewed partitions or shuffles	Shuffle threads can expose imbalance

🛠️ Example Analysis from Your File

Based on your uploaded dump, if you’re seeing:

Threads stuck on shuffle operations
Frequent references to GC, Finalizer, or Unsafe methods
Long WAITING or BLOCKED threads

Then you likely have:

GC overhead
Shuffle skew
Thread contention (locks)

You can search the file for keywords:

WAITING
BLOCKED
at org.apache.spark
at java.util.concurrent

✅ What Actions Can Be Taken

Increase executor memory or shuffle partitions if memory pressure is high
Reduce number of concurrent jobs or cores per executor if high contention
Use ZORDER, OPTIMIZE, and repartitioning to balance shuffle loads

Mohammad Gufran Jahangir

Tags: Databricks

Category:

Databricks