Tackling Data Skew in Spark Jobs: Causes, Detection, and Solutions

Mohammad Gufran Jahangir January 29, 2025 0

Table of Contents

Introduction

Data skew is one of the most notorious performance killers in Apache Spark jobs. When a handful of tasks handle disproportionately large datasets, your entire pipeline grinds to a halt—even if 99% of your data is evenly distributed. In this guide, we’ll demystify data skew in Spark, explain how to detect it, and share battle-tested strategies to mitigate its impact.

What Is Data Skew?

Data skew occurs when records in a dataset are unevenly distributed across partitions. For example:

A user_id column where 80% of events belong to a single super-user.
A country field where 90% of transactions originate from one region.

Skewed partitions cause straggler tasks—tasks that take significantly longer to complete than others—leading to wasted resources and slow job execution.

Signs of Data Skew

Spark UI Clues:
- A few tasks in a stage take hours, while others finish in seconds.
- Shuffle spill (data written to disk) in specific tasks.
- GC (Garbage Collection) time spikes in straggler tasks.
Log Warnings:
- Skew detected at partition X in Spark logs.
Performance Symptoms:
- Jobs stuck at 99% completion for extended periods.

Common Causes of Data Skew

Skewed Keys in Joins
- Joining on a column with highly uneven value distribution (e.g., customer_id with one VIP customer).
GROUP BY/Aggregations
- Aggregating on a column with a dominant value (e.g., country=USA in global sales data).
Data Ingestion Issues
- Source systems writing data unevenly (e.g., IoT sensors with one device generating 90% of logs).

How to Fix Data Skew: 6 Proven Strategies

1. Salting (Adding Random Prefixes/Suffixes)

Idea: Distribute skewed keys by appending a random value (e.g., user_id_0, user_id_1).
Steps:

Add a random salt to the skewed key in both datasets.
Perform the join on the salted key.
Aggregate the results to remove the salt.

Example:

from pyspark.sql.functions import concat, lit, rand  

# Add salt to skewed key  
salted_left = left_df.withColumn("salted_key", concat("user_id", lit("_"), (rand() * 10).cast("int")))  
salted_right = right_df.withColumn("salted_key", concat("user_id", lit("_"), (rand() * 10).cast("int")))  

# Join on salted keys  
joined_df = salted_left.join(salted_right, "salted_key")  

# Remove salt and aggregate  
result = joined_df.groupBy("user_id").agg(...)

2. Adaptive Query Execution (AQE)

Spark 3.0+ Feature: Automatically optimizes skewed joins by splitting large partitions.
Enable AQE:

spark.conf.set("spark.sql.adaptive.enabled", "true")  
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

AQE detects skew and redistributes data dynamically.

3. Broadcast Join for Small Skewed Tables

Idea: Broadcast smaller skewed tables to avoid shuffles.
Use Case: When one side of the join is small (e.g., a dimension table with skewed keys).

from pyspark.sql.functions import broadcast  

skewed_small_df = spark.read.table("skewed_small_table")  
large_df = spark.read.table("large_table")  

# Force broadcast  
result = large_df.join(broadcast(skewed_small_df), "user_id")

4. Repartitioning

Idea: Manually balance data using repartition() or coalesce().
Use Case: Skew caused by initial data partitioning.

# Repartition by a balanced column  
df = df.repartition(100, "balanced_column")  

# Or repartition before a join  
df = df.repartition("user_id")

5. Custom Partitioning

Idea: Design a custom partitioner to handle skewed keys.
Example (Scala):

import org.apache.spark.HashPartitioner  

class SkewAwarePartitioner(partitions: Int) extends HashPartitioner(partitions) {  
  override def getPartition(key: Any): Int = {  
    if (key == "skewed_key") 0 // Route skewed keys to partition 0  
    else super.getPartition(key) % (partitions - 1) + 1  
  }  
}  

val rdd = data.rdd  
  .map(...)  
  .partitionBy(new SkewAwarePartitioner(100))

6. Filter and Process Separately

Idea: Isolate skewed data and process it separately.

# Split data into skewed and non-skewed  
skewed_data = df.filter("user_id = 'VIP_USER'")  
non_skewed_data = df.filter("user_id != 'VIP_USER'")  

# Process skewed data with a different strategy  
result_skewed = process_skewed(skewed_data)  
result_non_skewed = process_normal(non_skewed_data)  

# Combine results  
final_result = result_skewed.union(result_non_skewed)

How to Detect Data Skew

Spark UI:
- Check the Stages tab for uneven task durations.
- Look for Shuffle Read Size/Records discrepancies.
Analyze Data Distribution:

df.groupBy("user_id").count().orderBy("count", ascending=False).show()

Spark Metrics:
- Monitor spark.sql.shuffle.partitions and shuffle spill metrics.

Best Practices to Prevent Skew

Design Keys Wisely:
- Avoid high-cardinality keys (e.g., timestamps) for joins/aggregations.
Preprocess Data:
- Use ETL pipelines to deduplicate or balance skewed keys.
Leverage Delta Lake:
- Use ZORDER BY on frequently filtered/joined columns.
Monitor Proactively:
- Set up alerts for long-running tasks or skewed partitions.

Real-World Example: Fixing a Skewed Aggregation

Scenario: A GROUP BY country job took 6 hours due to 80% of data coming from the USA.

Steps Taken:

Detected skew using df.groupBy("country").count().
Salted the country column:

salted_df = df.withColumn("salted_country", concat("country", lit("_"), (rand() * 5).cast("int")))

3.Aggregated on salted_country, then summed results:

result = salted_df.groupBy("salted_country").agg(sum("revenue"))  
final_result = result.withColumn("country", split("salted_country", "_")[0]).groupBy("country").sum("revenue")

4. Reduced runtime to 45 minutes.

Conclusion

Data skew is a common but surmountable challenge in Spark. By combining techniques like salting, AQE, and strategic repartitioning, you can neutralize skew’s impact and unlock faster, more reliable job performance.

Mohammad Gufran Jahangir

Tags: #ApacheSpark, #AQE, #DataSkew, #ShuffleOperations

Category: