How to reduce the number of shuffle partitions in Databrick

Mohammad Gufran Jahangir August 5, 2025 0

spark.conf.set("spark.sql.shuffle.partitions", 200)

in Databricks to reduce the number of shuffle partitions during wide transformations (like groupBy, join, distinct, repartition) so the driver and executors don’t get overwhelmed with too many small shuffle tasks.

Table of Contents

Where to use

You can set it in three places depending on your use case:

1. Inside a Notebook (Session Level)

Place it at the start of your notebook so it applies to all queries in that notebook session.

# Set to 200 instead of default (200 is just an example, adjust as needed)
spark.conf.set("spark.sql.shuffle.partitions", 200)

# Example query
df.groupBy("col1").count().show()

This only affects the current notebook execution and resets when the notebook restarts.

2. In a Job Notebook or Job Cluster

Add it at the top of your job’s main notebook so it applies to the whole job run.
Useful when you know your job’s dataset is smaller and doesn’t need the default 200–2000 partitions.

3. Cluster-wide Default (Spark Config in Cluster Settings)

In Databricks → Compute → Cluster → Edit → Advanced Options → Spark Config:

spark.sql.shuffle.partitions=200

This sets it for all notebooks and jobs running on that cluster.
Good for consistent tuning if most workloads are small-to-medium sized.

When to use

Default in Databricks is often 200 partitions, but some environments may have higher defaults.
Reduce it if:
- Your dataset is small.
- You see too many small shuffle files in Spark UI.
- Tasks are spending more time in overhead than processing data.
Increase it if:
- Dataset is huge and you want more parallelism.

💡 Example Impact
If you have 10 million rows and default spark.sql.shuffle.partitions=2000, Spark creates 2000 tiny shuffle files, which can overwhelm the driver. Setting it to 200 makes each shuffle partition larger, reducing task overhead and possibly fixing “driver unresponsive” issues.

Before-and-After Spark UI example for using

spark.conf.set("spark.sql.shuffle.partitions", 200)

to fix driver overload issues in Databricks.

📊 Scenario

You run:

df.groupBy("customer_id").agg(F.sum("amount")).display()

Dataset: ~10 million rows, ~1 GB after filtering.

Default setting:

spark.sql.shuffle.partitions = 2000

🔴 Before (High Shuffle Partitions)

Spark UI → Stages

Metric	Value
Shuffle Partitions	2000
Avg. Partition Size	~500 KB
Tasks per Stage	2000
Driver CPU	90–100%
GC Time (Driver)	High (frequent)
Stage Duration	6 min 45 sec

Impact

Too many tiny shuffle files.
Huge task scheduling overhead.
Driver spends more time managing tasks than actual processing.
“Driver is up but not responsive” risk increases.

🟢 After (Reduced Shuffle Partitions)

You set:

spark.conf.set("spark.sql.shuffle.partitions", 200)

Spark UI → Stages

Metric	Value
Shuffle Partitions	200
Avg. Partition Size	~5 MB
Tasks per Stage	200
Driver CPU	50–60%
GC Time (Driver)	Low
Stage Duration	2 min 10 sec

Impact

Larger shuffle files, fewer tasks.
Less scheduling overhead.
Lower driver memory pressure.
Faster overall runtime and reduced risk of driver GC pause.

💡 Where to Set in Databricks

Notebook Level (Temporary)

spark.conf.set("spark.sql.shuffle.partitions", 200)

Cluster Level (Permanent)

Databricks → Compute → Cluster → Edit → Advanced Options → Spark Config:

spark.sql.shuffle.partitions=200

You can check the default value of spark.sql.shuffle.partitions in Databricks (or any Spark environment) with a simple command:

In a Databricks Notebook

spark.conf.get("spark.sql.shuffle.partitions")

spark.conf.get("spark.sql.shuffle.partitions", None)

Example Output

'200'

This means the default shuffle partition count is 200 for your cluster.

In SQL Cell

SET spark.sql.shuffle.partitions;

This will return the current configured value.

From Spark UI

Go to Spark UI from your Databricks job or cluster.
Click Environment tab.
Search for spark.sql.shuffle.partitions in the list of configurations.
The value displayed there is the active one (could be default or overridden in cluster settings).

You can check the default value of spark.sql.shuffle.partitions in Databricks (or any Spark environment) with a simple command:

In a Databricks Notebook

spark.conf.get("spark.sql.shuffle.partitions")

spark.conf.get("spark.sql.shuffle.partitions", None)

Example Output

'200'

This means the default shuffle partition count is 200 for your cluster.

In SQL Cell

SET spark.sql.shuffle.partitions;

This will return the current configured value.

From Spark UI

Go to Spark UI from your Databricks job or cluster.
Click Environment tab.
Search for spark.sql.shuffle.partitions in the list of configurations.
The value displayed there is the active one (could be default or overridden in cluster settings).

📌 Pro Tip
Pick the number based on cluster size & data volume:

Small data: 50–200
Medium data (~100GB): 200–800
Large data (>1TB): 800–2000

Mohammad Gufran Jahangir

Tags: Databricks

Category:

Databricks