What is spark.sql.shuffle.partitions?

Mohammad Gufran Jahangir August 7, 2025 0

Table of Contents

🔍 What is `spark.sql.shuffle.partitions`?

spark.sql.shuffle.partitions is a Spark SQL configuration parameter that controls the number of output partitions created during shuffling operations, such as:

Joins
Aggregations (e.g., groupBy)
Window functions
distinct
Sorting

🧠 Why is it important?

Shuffling is one of the most expensive operations in Spark, involving:

Reading data across partitions
Writing to disk
Sending over the network
Repartitioning data for processing

So, the number of shuffle partitions has a direct impact on:

Aspect	Impact of Shuffle Partitions
Performance	Too many → task overhead; Too few → large partitions (slow)
Memory usage	Too many → memory pressure due to task proliferation
Parallelism	More partitions → better parallelism if cluster has enough cores
Disk + GC pressure	High number → more shuffle files, possible GC pressure
Resource utilization	Must match your executor/core settings for efficiency

⚙️ How does it work?

Let’s say you do a groupBy or a join:

df.groupBy("country").count()

This triggers a shuffle, and Spark will use the number set in spark.sql.shuffle.partitions to divide the intermediate output.

Default is 200 (in most Spark/Databricks setups)
Each output partition becomes a task in the next stage

Example:

If set to 200:

Spark will create 200 output partitions from the shuffle stage
Next stage will process these 200 tasks
If cluster has 16 cores, 16 tasks run in parallel

🛠️ When and How to Tune It

Scenario	Recommendation
Small Dataset (few MB/GB)	🔽 Reduce partitions (e.g., `spark.conf.set("spark.sql.shuffle.partitions", 10)`) to avoid too many empty/small tasks
Large Dataset (100+ GB or TB scale)	🔼 Increase partitions (e.g., 500–1000) to ensure no partition is too large
Job has skewed performance	✅ Tune this alongside `salting`, `repartition`, or `broadcast join` strategies
You see long-running tasks (especially post-shuffle)	⏳ Could be too few partitions → increase
You see high scheduling delay or too many tasks pending	🔄 Could be too many partitions → decrease or match with cores

💡 Rule of Thumb

Shuffle partitions ≈ 2x to 4x the number of executor cores
For 20 executor cores → set it to 40–80
Don’t blindly increase to 1000+ unless you’re handling very large data

✅ How to Check and Set It

# Check current value
spark.conf.get("spark.sql.shuffle.partitions")

# Set a new value (effective for current Spark session)
spark.conf.set("spark.sql.shuffle.partitions", 100)

To set permanently for a cluster, define in:

Cluster Spark Config
Init script
Pipeline settings

🔍 Example Before/After Optimization

Stage	Before Tuning	After Tuning
Shuffle partitions	200	50
Execution time	10 min	4 min
GC time	2 min	0.5 min
Failed tasks	6	0

📦 Related Parameters

spark.default.parallelism: Used for RDD operations
spark.sql.files.maxPartitionBytes: Controls partition size before shuffling
spark.dynamicAllocation.*: Can help tune executor count

`spark.sql.shuffle.partitions` can be defined at multiple levels

🔹 1. Notebook Level (Session Scope) ✅

You can define it within a notebook using the Spark configuration API. This change will apply only to the current Spark session (i.e., until the cluster restarts or notebook ends).

✅ Example:

spark.conf.set("spark.sql.shuffle.partitions", 50)

🔍 To verify:

spark.conf.get("spark.sql.shuffle.partitions")

🔹 2. Job/Script Level (e.g., PySpark or Scala jobs)

If you’re submitting a job via a pipeline or job scheduler, you can set this in your job script before any operation that triggers a shuffle.

Python:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyApp") \
    .config("spark.sql.shuffle.partitions", "100") \
    .getOrCreate()

🔹 3. Cluster Level (Default for All Jobs/Notebooks on the Cluster)

Set this in the cluster configuration to make it the default for all notebooks and jobs running on the cluster.

On Databricks:

Go to Compute → choose your cluster → Configuration
In Spark Config, add: spark.sql.shuffle.partitions=200
Restart the cluster

🔹 4. Job Cluster / Job Settings (Databricks Jobs)

If you are scheduling a job (via UI or REST API), you can include this config under the spark_conf block in the job definition.

{
  "spark_conf": {
    "spark.sql.shuffle.partitions": "300"
  }
}

🔹 5. Init Script Level (Advanced)

For production-level tuning, you can use init scripts to set this during cluster startup. This ensures every new cluster created from a pool or job template has the setting.

📌 Summary Table

Scope	Where to Set	Applies To
Notebook	`spark.conf.set(...)`	That notebook only (session)
Script/Code	`.config(...)` in SparkSession builder	That specific job or script
Cluster	Spark Config section in Databricks UI	All jobs & notebooks on that cluster
Job Definition	`spark_conf` in job config	That specific scheduled job
Init Script	Startup script for cluster	All clusters using that init script

Mohammad Gufran Jahangir

Tags: Databricks

Category:

Databricks