Mohammad Gufran Jahangir August 7, 2025 0

To monitor and troubleshoot long-running sessions in Azure Databricks, it’s essential to understand the execution hierarchy and how to navigate through the Spark UI. Below is a complete guide with explanations, insights, and key areas to monitor.


🔍 How Spark Jobs Execute in Azure Databricks

Hierarchy of Execution:

LevelDescription
NotebookUser-triggered execution unit, can contain multiple cells and logic
Spark JobRepresents an action (e.g., collect(), show(), save()) triggered
StageA set of tasks that can be executed in parallel. One job = multiple stages
TaskSmallest unit of execution; assigned to executor slots on worker nodes

🚦 Step-by-Step: How to Monitor a Long-Running Notebook

🟢 Step 1: Open the Spark UI

  • Go to the notebook.
  • Click on the “View” link beside a running job to drill down into details.

📊 Step 2: Analyze Jobs and Stages

TabWhat to Look For
Jobs TabDuration of each job. Look for jobs taking unusually long.
Stages TabCheck if any stage is stuck, retrying, or has skewed partitions.
Tasks per StageHigh variance in task durations = skew. Look for stages with large input sizes.
Executors TabMonitor CPU/memory usage, GC time, input/output shuffle sizes.

🔧 Common Causes of Long-Running Jobs & Fixes

IssueWhere to LookFix/Strategy
Skewed JoinsStages → Task durationUse salting or broadcast joins
Too Many Partitionsspark.sql.shuffle.partitionsDecrease shuffle partitions
Insufficient ExecutorsExecutors tabEnable dynamic allocation or increase executors
High GC TimeExecutors → GC TimeIncrease executor memory or tune garbage collection
Collect() on Large DataDAG & SQL tabAvoid using collect() on large datasets

📌 Deep Dive: Tools and Values to Track

MetricMeaningThresholds/Insights
Shuffle Read/WriteIndicates shuffling of data between stagesHigh shuffle = poor partitioning or bad join strategy
GC TimeTime spent in Garbage Collection>10% of task time = memory tuning required
Executor Logs (stdout/stderr)Shows actual logs/errorsUse for debugging job-specific issues
DAG VisualizationShows flow of job execution (narrow vs wide transformations)Wide = shuffle boundaries

🧠 What Happens During collect()?

  • collect() pulls entire data from executors into the driver.
  • ⚠️ Dangerous for large datasets: Can crash driver or lead to Out of Memory.

Use only for small datasets, or switch to:

df.limit(1000).toPandas()  # safer for sampling

📂 Common Optimization Techniques

StrategyWhen to Use
Broadcast JoinWhen one table is small (few MBs)
SaltingWhen data is skewed on a join key
RepartitioningWhen uneven partitions or high shuffle is detected
CachingWhen reused across multiple stages
Dynamic AllocationWhen job needs vary in executor count

Spark Monitoring Checklist

AreaCheck What?Tool/UI TabFix/Action
Job DurationLong-running jobs in Spark UIJobs tabBreak logic into smaller actions
Stage DelayStages stuck or retriedStages tabRepartition or fix logic causing retries
Task SkewUneven task execution timeStages > TasksApply salting or broadcast joins
GC TimeGC Time >10% of task timeExecutors tabIncrease executor memory or tune GC
Shuffle Read/WriteHigh shuffle size or imbalanceStages tabAvoid wide transformations, optimize joins
Broadcast JoinUsed for small dimension tablesSQL Tab / DAGEnable broadcast hint
Executor MemoryMemory pressure or OOMExecutors tabIncrease memory or reduce collect()
Dynamic AllocationIs it enabled, limits tuned?Environment tabEnable + tune min/max executors
RepartitioningIs partitioning optimal?Spark config & Stage DAGUse coalesce/repartition with balance
DAG VisualizationVisual flow of transformationsJobs > DAGDebug performance bottlenecks

💬 Conclusion

When investigating a long-running Databricks notebook, always start from the Job panel → drill down into Stages → Tasks → Executors. Keep an eye on GC time, shuffle behavior, and data movement. Visual aids like DAG visualization and thread dumps help you uncover skew, serialization issues, and resource bottlenecks.

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments