To monitor and troubleshoot long-running sessions in Azure Databricks, it’s essential to understand the execution hierarchy and how to navigate through the Spark UI. Below is a complete guide with explanations, insights, and key areas to monitor.
User-triggered execution unit, can contain multiple cells and logic
Spark Job
Represents an action (e.g., collect(), show(), save()) triggered
Stage
A set of tasks that can be executed in parallel. One job = multiple stages
Task
Smallest unit of execution; assigned to executor slots on worker nodes
🚦 Step-by-Step: How to Monitor a Long-Running Notebook
🟢 Step 1: Open the Spark UI
Go to the notebook.
Click on the “View” link beside a running job to drill down into details.
📊 Step 2: Analyze Jobs and Stages
Tab
What to Look For
Jobs Tab
Duration of each job. Look for jobs taking unusually long.
Stages Tab
Check if any stage is stuck, retrying, or has skewed partitions.
Tasks per Stage
High variance in task durations = skew. Look for stages with large input sizes.
Executors Tab
Monitor CPU/memory usage, GC time, input/output shuffle sizes.
🔧 Common Causes of Long-Running Jobs & Fixes
Issue
Where to Look
Fix/Strategy
Skewed Joins
Stages → Task duration
Use salting or broadcast joins
Too Many Partitions
spark.sql.shuffle.partitions
Decrease shuffle partitions
Insufficient Executors
Executors tab
Enable dynamic allocation or increase executors
High GC Time
Executors → GC Time
Increase executor memory or tune garbage collection
Collect() on Large Data
DAG & SQL tab
Avoid using collect() on large datasets
📌 Deep Dive: Tools and Values to Track
Metric
Meaning
Thresholds/Insights
Shuffle Read/Write
Indicates shuffling of data between stages
High shuffle = poor partitioning or bad join strategy
GC Time
Time spent in Garbage Collection
>10% of task time = memory tuning required
Executor Logs (stdout/stderr)
Shows actual logs/errors
Use for debugging job-specific issues
DAG Visualization
Shows flow of job execution (narrow vs wide transformations)
Wide = shuffle boundaries
🧠 What Happens During collect()?
collect() pulls entire data from executors into the driver.
⚠️ Dangerous for large datasets: Can crash driver or lead to Out of Memory.
Use only for small datasets, or switch to:
df.limit(1000).toPandas() # safer for sampling
📂 Common Optimization Techniques
Strategy
When to Use
Broadcast Join
When one table is small (few MBs)
Salting
When data is skewed on a join key
Repartitioning
When uneven partitions or high shuffle is detected
Caching
When reused across multiple stages
Dynamic Allocation
When job needs vary in executor count
Spark Monitoring Checklist
Area
Check What?
Tool/UI Tab
Fix/Action
Job Duration
Long-running jobs in Spark UI
Jobs tab
Break logic into smaller actions
Stage Delay
Stages stuck or retried
Stages tab
Repartition or fix logic causing retries
Task Skew
Uneven task execution time
Stages > Tasks
Apply salting or broadcast joins
GC Time
GC Time >10% of task time
Executors tab
Increase executor memory or tune GC
Shuffle Read/Write
High shuffle size or imbalance
Stages tab
Avoid wide transformations, optimize joins
Broadcast Join
Used for small dimension tables
SQL Tab / DAG
Enable broadcast hint
Executor Memory
Memory pressure or OOM
Executors tab
Increase memory or reduce collect()
Dynamic Allocation
Is it enabled, limits tuned?
Environment tab
Enable + tune min/max executors
Repartitioning
Is partitioning optimal?
Spark config & Stage DAG
Use coalesce/repartition with balance
DAG Visualization
Visual flow of transformations
Jobs > DAG
Debug performance bottlenecks
💬 Conclusion
When investigating a long-running Databricks notebook, always start from the Job panel → drill down into Stages → Tasks → Executors. Keep an eye on GC time, shuffle behavior, and data movement. Visual aids like DAG visualization and thread dumps help you uncover skew, serialization issues, and resource bottlenecks.