spark.conf.set("spark.sql.shuffle.partitions", 200)
in Databricks to reduce the number of shuffle partitions during wide transformations (like groupBy, join, distinct, repartition) so the driver and executors donβt get overwhelmed with too many small shuffle tasks.
Where to use
You can set it in three places depending on your use case:
1. Inside a Notebook (Session Level)
- Place it at the start of your notebook so it applies to all queries in that notebook session.
# Set to 200 instead of default (200 is just an example, adjust as needed)
spark.conf.set("spark.sql.shuffle.partitions", 200)
# Example query
df.groupBy("col1").count().show()
- This only affects the current notebook execution and resets when the notebook restarts.
2. In a Job Notebook or Job Cluster
- Add it at the top of your job’s main notebook so it applies to the whole job run.
- Useful when you know your job’s dataset is smaller and doesnβt need the default
200β2000partitions.
3. Cluster-wide Default (Spark Config in Cluster Settings)
- In Databricks β Compute β Cluster β Edit β Advanced Options β Spark Config:
spark.sql.shuffle.partitions=200
- This sets it for all notebooks and jobs running on that cluster.
- Good for consistent tuning if most workloads are small-to-medium sized.
When to use
- Default in Databricks is often 200 partitions, but some environments may have higher defaults.
- Reduce it if:
- Your dataset is small.
- You see too many small shuffle files in Spark UI.
- Tasks are spending more time in overhead than processing data.
- Increase it if:
- Dataset is huge and you want more parallelism.
π‘ Example Impact
If you have 10 million rows and default spark.sql.shuffle.partitions=2000, Spark creates 2000 tiny shuffle files, which can overwhelm the driver. Setting it to 200 makes each shuffle partition larger, reducing task overhead and possibly fixing “driver unresponsive” issues.
Before-and-After Spark UI example for using
spark.conf.set("spark.sql.shuffle.partitions", 200)
to fix driver overload issues in Databricks.
π Scenario
You run:
df.groupBy("customer_id").agg(F.sum("amount")).display()
Dataset: ~10 million rows, ~1 GB after filtering.
Default setting:
spark.sql.shuffle.partitions = 2000
π΄ Before (High Shuffle Partitions)
Spark UI β Stages
| Metric | Value |
|---|---|
| Shuffle Partitions | 2000 |
| Avg. Partition Size | ~500 KB |
| Tasks per Stage | 2000 |
| Driver CPU | 90β100% |
| GC Time (Driver) | High (frequent) |
| Stage Duration | 6 min 45 sec |
Impact
- Too many tiny shuffle files.
- Huge task scheduling overhead.
- Driver spends more time managing tasks than actual processing.
- βDriver is up but not responsiveβ risk increases.
π’ After (Reduced Shuffle Partitions)
You set:
spark.conf.set("spark.sql.shuffle.partitions", 200)
Spark UI β Stages
| Metric | Value |
|---|---|
| Shuffle Partitions | 200 |
| Avg. Partition Size | ~5 MB |
| Tasks per Stage | 200 |
| Driver CPU | 50β60% |
| GC Time (Driver) | Low |
| Stage Duration | 2 min 10 sec |
Impact
- Larger shuffle files, fewer tasks.
- Less scheduling overhead.
- Lower driver memory pressure.
- Faster overall runtime and reduced risk of driver GC pause.
π‘ Where to Set in Databricks
Notebook Level (Temporary)
spark.conf.set("spark.sql.shuffle.partitions", 200)
Cluster Level (Permanent)
- Databricks β Compute β Cluster β Edit β Advanced Options β Spark Config:
spark.sql.shuffle.partitions=200
You can check the default value of spark.sql.shuffle.partitions in Databricks (or any Spark environment) with a simple command:
In a Databricks Notebook
spark.conf.get("spark.sql.shuffle.partitions")
or
spark.conf.get("spark.sql.shuffle.partitions", None)
Example Output
'200'
This means the default shuffle partition count is 200 for your cluster.
In SQL Cell
SET spark.sql.shuffle.partitions;
This will return the current configured value.
From Spark UI
- Go to Spark UI from your Databricks job or cluster.
- Click Environment tab.
- Search for
spark.sql.shuffle.partitionsin the list of configurations. - The value displayed there is the active one (could be default or overridden in cluster settings).
You can check the default value of spark.sql.shuffle.partitions in Databricks (or any Spark environment) with a simple command:
In a Databricks Notebook
spark.conf.get("spark.sql.shuffle.partitions")
or
spark.conf.get("spark.sql.shuffle.partitions", None)
Example Output
'200'
This means the default shuffle partition count is 200 for your cluster.
In SQL Cell
SET spark.sql.shuffle.partitions;
This will return the current configured value.
From Spark UI
- Go to Spark UI from your Databricks job or cluster.
- Click Environment tab.
- Search for
spark.sql.shuffle.partitionsin the list of configurations. - The value displayed there is the active one (could be default or overridden in cluster settings).
π Pro Tip
Pick the number based on cluster size & data volume:
- Small data: 50β200
- Medium data (~100GB): 200β800
- Large data (>1TB): 800β2000