Executors in Databricks covering:
- What Executors are
- Their lifecycle (Active / Dead / Total)
- How they work in Databricks (with Spark)
- Common performance issues (memory, GC, shuffle)
- How to debug executor-related problems
- Best practices for optimization
Understanding Executors in Databricks: A Complete Guide with Debugging Tips
When running workloads in Databricks, one of the most important concepts to understand for performance tuning and troubleshooting is the Executor. Executors are the workhorses of Apache Spark — and knowing how they behave can help you prevent job failures, improve speed, and reduce costs.
1. What is an Executor?
In Spark (and Databricks), an executor is a distributed agent that runs on worker nodes and is responsible for:
- Executing tasks assigned by the driver
- Storing data in memory or disk (RDDs, DataFrames)
- Communicating with the driver and other executors
Key points:
- Each executor runs in its own JVM process.
- Executors are allocated per application and live until the application ends (unless dynamic allocation removes them).
- Executors process tasks in parallel using multiple cores.
2. Executors in Databricks Spark UI
When you open the Executors tab in the Databricks Spark UI, you’ll see:
| Column | Meaning |
|---|---|
| Active Executors | Currently running executors doing work |
| Dead Executors | Executors that have finished work or were removed/freed |
| Cores | Number of CPU cores per executor |
| Active Tasks | Tasks currently executing on that executor |
| Failed Tasks | Tasks that failed on this executor |
| Storage Memory | Memory available for caching / storage |
| Shuffle Read/Write | Data read/written during shuffle phases |
| GC Time | Garbage collection time — high GC means memory pressure |
Active Executors
These are alive and currently processing tasks.
Example:
Active(9)
means 9 executors are alive right now.
Dead Executors
These are executors that:
- Completed all their tasks and shut down (normal)
- Were killed due to dynamic allocation removing idle executors
- Crashed due to memory errors, driver unresponsiveness, or GC issues
Example:
Dead(99)
means 99 executors have already stopped.
💡 Tip: A large number of dead executors with many failed tasks is often a red flag for resource or data skew problems.
3. Executor Lifecycle
- Allocated → Databricks requests an executor from the cluster manager (YARN / Standalone / Databricks runtime).
- Active → Executor runs tasks and stores cached data.
- Idle → If no tasks are assigned for a while (dynamic allocation), it may be removed.
- Dead → Removed or terminated.
4. Common Executor Problems in Databricks
a. Memory Pressure / GC Issues
- Symptom: Driver is up but not responsive, likely due to GC
- Cause: Executors spending too much time in garbage collection.
- Fix: Increase executor memory or reduce shuffle size.
b. Too Many Shuffle Partitions
- Default:
spark.sql.shuffle.partitions = 200 - Too many small partitions → high overhead, more executors created/destroyed.
- Too few → large tasks, slower processing.
- Fix: Tune based on data size:
spark.conf.set("spark.sql.shuffle.partitions", 100)
c. Data Skew
- One executor handles most of the data while others are idle.
- Fix: Use salting techniques or repartition more evenly.
d. Too Many Short-lived Executors
- Happens with dynamic allocation when jobs have many small stages.
- Fix: Disable dynamic allocation for short jobs:
spark.conf.set("spark.dynamicAllocation.enabled", "false")
5. Debugging Executor Issues
Step 1: Check Executors Tab
- Look for:
- High Failed Tasks
- High GC Time
- Unbalanced shuffle read/write
- Large difference between Active and Dead counts
Step 2: Drill into Stages
- Go to Stages tab → Find stages with long runtimes.
- Click on them to see task-level details (time, shuffle, spill).
Step 3: Look for GC / Memory Issues
- Executors with very high GC time often need:
- Bigger memory allocation
- Reduced shuffle partitions
- More efficient data formats (e.g., Parquet with predicate pushdown)
Step 4: Adjust Configuration
Some common configs to try:
spark.conf.set("spark.sql.shuffle.partitions", 100) # Reduce overhead
spark.conf.set("spark.executor.memory", "8g") # Increase executor memory
spark.conf.set("spark.dynamicAllocation.maxExecutors", 50)
6. Best Practices for Executors in Databricks
- Use Delta Lake with ZORDER + OPTIMIZE to reduce shuffle I/O.
- Monitor Shuffle Read/Write — high values often mean expensive joins/groupBy.
- Avoid collect() and other driver-heavy operations on large datasets.
- Use broadcast joins for small tables:
from pyspark.sql.functions import broadcast
df_large.join(broadcast(df_small), "id")
- Right-size shuffle partitions based on dataset size:
- Small jobs → 50–100 partitions
- Large jobs → 200–400 partitions
In the Spark Executors tab:
- Active Executors (Active(9)) → These are executors currently running and handling tasks.
- Example: You have 9 executors actively processing data right now.
- Dead Executors (Dead(99)) → These are executors that have already finished their work and shut down (or were removed).
- They may have completed tasks successfully or failed at some point.
- Seeing a large number of dead executors (99) often means:
- The job used dynamic allocation (Spark added executors when needed, then removed them when idle).
- Or executors failed due to memory pressure, GC issues, or driver unresponsiveness.
Why it matters for your case:
- You have a lot of dead executors and 629 failed tasks, which could be a symptom of:
- Too many shuffle partitions (overhead → executors die faster).
- Insufficient executor memory (causing OOM or GC stalls).
- Too many short-lived executors from dynamic allocation.

1. When to Increase or Decrease Shuffle Partitions
🔍 Investigation Needed:
- Check the default value:
spark.conf.get("spark.sql.shuffle.partitions") - Go to Spark UI > Stages and check:
- Number of tasks created per stage
- Skewed partitions (some tasks take way longer)
- Long shuffle read/write times
- High GC (Garbage Collection) time
📌 When to Increase spark.sql.shuffle.partitions:
- Large datasets causing skew in partitions
- Some tasks taking much longer (stragglers)
- Shuffle read/write per task is very high (hundreds of MBs to GBs)
- Long stage duration due to under-parallelism
✅ Impact of Increasing:
- Smaller partitions, more parallelism
- May improve stage completion time
- Slight increase in overhead if partition count is too high
📌 When to Decrease:
- For small to medium data sizes
- Too many partitions (overhead without benefit)
- High task scheduling overhead, high GC from many small tasks
⚠️ Impact of Decreasing:
- Risk of under-parallelism
- Few tasks might do too much work → skew
2. When to Increase or Decrease Executor Memory
🔍 Investigation Needed:
- Go to Spark UI > Executors
- Look for:
- High memory usage (close to max)
- Task failures with
OutOfMemoryError - Frequent GC (check GC Time column)
- Shuffle spill to disk (check Shuffle Write)
📌 When to Increase Executor Memory:
- Tasks fail due to OOM errors
- High GC time or memory spills observed
- Caching large datasets in memory
✅ Impact of Increasing:
- Fewer memory-related crashes
- Less GC pressure, better performance
- But: fewer executors fit on a node → less parallelism
📌 When to Decrease Executor Memory:
- Memory is under-utilized
- Want to increase executor count (scale out)
- Running many small jobs → smaller footprint helps concurrency
⚠️ Impact of Decreasing:
- Higher risk of memory errors
- More spills to disk
- But more executors → better concurrency
3. When to Disable Dynamic Allocation for Short Jobs
🔍 Investigation Needed:
- Job is very short-lived (a few seconds to 1–2 mins)
- Most of job time spent on waiting for executors to be provisioned
- You observe delays like:
- “Pending resources” in Spark UI
- Cluster scaling takes longer than job duration
📌 When to Disable Dynamic Allocation:
- For quick ETL jobs, scheduled frequently
- Not worth spinning up and down executors every time
- Stable executor config preferred (faster start)
✅ Impact of Disabling:
- Predictable performance
- No delay due to executor warm-up
- Slightly more resource usage when idle
⚠️ But:
- Higher cost if idle
- May lead to unused resources when job not running
🔚 Summary Table
| Scenario | Action | Why / When | Impact |
|---|---|---|---|
| High shuffle time per task | Increase shuffle partitions | Large data, uneven partitioning, long stage duration | More tasks, better parallelism, less skew |
| Too many small partitions | Decrease shuffle partitions | Small job/data, high task overhead | Less task overhead, risk of skew |
| OOM or memory spills | Increase executor memory | Large joins, caching, GC issues | Better performance, but fewer executors |
| Low memory use / want scale out | Decrease executor memory | Many small tasks, better concurrency | Higher parallelism, lower memory per task |
| Short frequent jobs | Disable dynamic allocation | Avoid executor provisioning overhead | Faster execution, slightly more cost |
Conclusion
Executors are at the heart of Spark jobs in Databricks. Understanding Active vs Dead executors, monitoring GC time, and tuning partitions and memory can prevent common issues like driver unresponsiveness and task failures.