To View compute information in the Apache Spark UI
You can view detailed information about Spark jobs by selecting the Spark UI tab on the compute details page.
If you restart a terminated compute, the Spark UI displays information for the restarted compute, not the historical information for the terminated compute.
How to open the Spark UI
- Navigate to your cluster’s page
- Click Spark UI
How to open the jobs timeline
In the Spark UI, click on Jobs and Event Timeline as highlighted in red in the following screenshot. You will see the timeline. This example shows the driver and executor 0 being added:
Failing jobs or failing executors
Executor – In Azure Databricks, an executor is a component that does the actual work of processing data. When you run a Spark job, it gets divided into smaller tasks that are distributed across multiple executors. Each executor runs on a separate node in the cluster and handles computations and storing data in memory. This setup allows for efficient parallel processing of large datasets.
failed job and removed executors, indicated by a red status, in the event timeline.
The most common reasons for executors being removed are:
- Autoscaling
- Spot instance losses
- Executors running out of memory
Failing jobs
If you see any failing jobs click on them to get to their pages.
You may get a generic error. Click on the link in the description to see if you can get more info:
If you scroll down in this page, you will be able to see why each task failed. In this case it’s becoming clear there’s a memory issue:
Failing executors
To find out why your executors are failing, you’ll first want to check the compute’s Event log
- See if there are any events explaining the loss of executors.
- see messages indicating that the cluster is resizing or spot instances are being lost
If you don’t see any information in the event log, navigate back to the Spark UI then click the Executors tab:
you can get the logs from the failed executors:
Gaps in execution
Look for gaps of a minute or more, such as in this example:
- Observing Gaps: The text suggests identifying and analyzing any gaps in the activity timeline highlighted by red arrows.
- Duration of Gaps: It differentiates between short gaps, which are normal as they occur when the system’s driver is organizing tasks, and longer gaps, which might indicate issues.
- Location and Context of Gaps: Understanding where these gaps occur is crucial—whether they appear during a continuous operation or at specific points can hint at their causes.
- Analysis by Time: Checking when the workload began and ended can help determine why these gaps occurred, such as whether they align with expected pauses in activity or if they indicate a problem.
There are a few things that could be happening during the gaps:
- There’s no work to do
- Driver is compiling a complex execution plan
- Execution of non-spark code
- Driver is overloaded
- Cluster is malfunctioning
No work
having no work to do is the most likely explanation for the gaps. Because the cluster is running and users are submitting queries, gaps are expected. These gaps are the time between query submissions.
Complex execution plan
Using withColumn()
repeatedly in a loop creates complex and slow processing plans. To improve efficiency, combine these operations using selectExpr()
or translate them into SQL embedded within Python.
Execution of non-Spark code
Any execution of code that is not Spark will show up in the timeline as gaps.If the code is using Spark, you will see Spark jobs under the cell:
Driver is overloaded
View distribution with legacy Ganglia metrics
Look at the Server Load Distribution visualization, which is highlighted here in red:
This visualization has a block of color for each machine in the cluster. Red means heavily loaded, and blue means not loaded at all. The above distribution shows a basically idle cluster. If the driver is overloaded, it would look something like this:
if Spark driver overloaded
- Increase the size of your driver
- Reduce the concurrency
- Spread the load over multiple clusters
Cluster is malfunctioning
You may just want to restart the cluster to see if this resolves the issue. The Event log tab and Driver logs tabs, highlighted in the screenshot below look into the logs to see if there’s anything suspicious
Long jobs
These long jobs would be something to investigate. In the following example, the workload has one job that’s much longer than the others. This is a good target for investigation.
Diagnosing a long stage in Spark
Start by identifying the longest stage of the job. Scroll to the bottom of the job’s page to the list of stages and order them by duration:
Go to spark UI –>stage –> completed stage or running stage –> sort by duration
Stage I/O details
To see high-level data about what this stage was doing, look at the Input, Output, Shuffle Read, and Shuffle Write columns:
The columns mean the following:
- Input: How much data this stage read from storage. This could be reading from Delta, Parquet, CSV, etc.
- Output: How much data this stage wrote to storage. This could be writing to Delta, Parquet, CSV, etc.
- Shuffle Read: How much shuffle data this stage read.
- Shuffle Write: How much shuffle data this stage wrote.
Data shuffling in Spark occurs with complex operations like joins and aggregations and can slow down processing by moving data across nodes. To manage or reduce this, you can use strategies like:
- Broadcast Hash Join: Avoids shuffling by sending smaller tables directly to all nodes. You can adjust Spark’s settings to increase the size limit of tables automatically broadcasted, or manually specify which tables to broadcast using SQL hints.
- Shuffle Hash Join over Sort-Merge Join: Opt for Shuffle Hash Join, which is generally faster because it skips the sorting step required in Sort-Merge Join. You can set Spark preferences to favor Shuffle Hash Join.
- Leverage Cost-Based Optimizer (CBO): Improves query plans based on statistics of table sizes and column data, which helps in choosing the best join strategy and join order. Ensure CBO is enabled and statistics are updated to maximize its effectiveness.
- Driver and Memory Management: Be cautious with the size of data broadcasted to avoid overloading the driver’s memory, which can lead to errors or performance issues. Adjust driver settings to accommodate larger datasets if necessary.
These optimizations help manage resources more efficiently, potentially reducing runtime and improving the performance of Spark applications.
Number of tasks
The number of tasks in the long stage can point you in the direction of your issue. You can determine the number of tasks by looking here:
If you see one task, that could be a sign of a problem.
If you see a long-running stage with just one task, that’s likely a sign of a problem. While this one task is running only one CPU is utilized and the rest of the cluster may be idle.
This happens most frequently in the following situations:
- Expensive UDF on small data
- Window function without PARTITION BY statement
- Reading from an unsplittable file type. This means the file cannot be read in multiple parts, so you end up with one big task. Gzip is an example of an unsplittable file type.
- Setting the multiLine option when reading a JSON or CSV file
- Schema inference of a large file
- Use of repartition(1) or coalesce(1)
View more stage details
If the stage has more than one task, you should investigate further. Click on the link in the stage’s description to get more info about the longest stage:
For this Skew and spill
Spill – he first thing to look for in a long-running stage is whether there’ spill
Spill is what happens when Spark runs low on memory. It starts to move data from memory to disk, and this can be quite expensive. It is most common during data shuffling.
If the stage has some spill than follow below:
In Databricks, file size tuning involves adjusting the size of files managed by the Auto-optimize and Optimize functions, which default to 128MB and 1GB, respectively. If these sizes are not ideal for your workload, you can customize the file size by setting the delta.targetFileSize
property. For example:
-- Set target file size to 128MB
SET delta.targetFileSize = 134217728;
Alternatively, you can automate this process by enabling the delta.tuneFileSizesForRewrites
property.When activated, Databricks automatically adjusts file sizes based on your specific activities, like frequent merges, optimizing performance without manual intervention.
-- Enable automatic file size tuning
SET delta.tuneFileSizesForRewrite = true;
If it is Skew
The next thing we want to look into is whether there’s skew. Skew is when one or just a few tasks take much longer than the rest.
Scroll down to the Summary Metrics. The main thing we’re looking for is the Max duration being much higher than the 75th percentile duration. If the Max duration is 50% more than the 75th percentile, you may be suffering from skew.
No skew or spill
If you don’t see skew or spill, go back to the job page to get an overview of what’s going on. Scroll up to the top of the page and click Associated Job Ids:
If the stage doesn’t have spill or skew, check Spark stage high I/O
Spark stage high I/O
look at the I/O stats of the longest stage again:
What is high I/O?
High I/O occurs when data processing demands exceed about 3 MB per second per CPU core in your system. If your calculations show data transfers around this rate, your system is likely struggling with high input/output operations.
High input
If you see a lot of input into your stage, that means you’re spending a lot of time reading data.
High output
If you see a lot of output from your stage, that means you’re spending a lot of time writing data.