Mohammad Gufran Jahangir August 5, 2025 0

spark.conf.set("spark.sql.shuffle.partitions", 200)

in Databricks to reduce the number of shuffle partitions during wide transformations (like groupBy, join, distinct, repartition) so the driver and executors don’t get overwhelmed with too many small shuffle tasks.


Where to use

You can set it in three places depending on your use case:

1. Inside a Notebook (Session Level)

  • Place it at the start of your notebook so it applies to all queries in that notebook session.
# Set to 200 instead of default (200 is just an example, adjust as needed)
spark.conf.set("spark.sql.shuffle.partitions", 200)

# Example query
df.groupBy("col1").count().show()
  • This only affects the current notebook execution and resets when the notebook restarts.

2. In a Job Notebook or Job Cluster

  • Add it at the top of your job’s main notebook so it applies to the whole job run.
  • Useful when you know your job’s dataset is smaller and doesn’t need the default 200–2000 partitions.

3. Cluster-wide Default (Spark Config in Cluster Settings)

  • In Databricks β†’ Compute β†’ Cluster β†’ Edit β†’ Advanced Options β†’ Spark Config:
spark.sql.shuffle.partitions=200
  • This sets it for all notebooks and jobs running on that cluster.
  • Good for consistent tuning if most workloads are small-to-medium sized.

When to use

  • Default in Databricks is often 200 partitions, but some environments may have higher defaults.
  • Reduce it if:
    • Your dataset is small.
    • You see too many small shuffle files in Spark UI.
    • Tasks are spending more time in overhead than processing data.
  • Increase it if:
    • Dataset is huge and you want more parallelism.

πŸ’‘ Example Impact
If you have 10 million rows and default spark.sql.shuffle.partitions=2000, Spark creates 2000 tiny shuffle files, which can overwhelm the driver. Setting it to 200 makes each shuffle partition larger, reducing task overhead and possibly fixing “driver unresponsive” issues.


Before-and-After Spark UI example for using

spark.conf.set("spark.sql.shuffle.partitions", 200)

to fix driver overload issues in Databricks.


πŸ“Š Scenario

You run:

df.groupBy("customer_id").agg(F.sum("amount")).display()

Dataset: ~10 million rows, ~1 GB after filtering.

Default setting:

spark.sql.shuffle.partitions = 2000

πŸ”΄ Before (High Shuffle Partitions)

Spark UI β†’ Stages

MetricValue
Shuffle Partitions2000
Avg. Partition Size~500 KB
Tasks per Stage2000
Driver CPU90–100%
GC Time (Driver)High (frequent)
Stage Duration6 min 45 sec

Impact

  • Too many tiny shuffle files.
  • Huge task scheduling overhead.
  • Driver spends more time managing tasks than actual processing.
  • β€œDriver is up but not responsive” risk increases.

🟒 After (Reduced Shuffle Partitions)

You set:

spark.conf.set("spark.sql.shuffle.partitions", 200)

Spark UI β†’ Stages

MetricValue
Shuffle Partitions200
Avg. Partition Size~5 MB
Tasks per Stage200
Driver CPU50–60%
GC Time (Driver)Low
Stage Duration2 min 10 sec

Impact

  • Larger shuffle files, fewer tasks.
  • Less scheduling overhead.
  • Lower driver memory pressure.
  • Faster overall runtime and reduced risk of driver GC pause.

πŸ’‘ Where to Set in Databricks

Notebook Level (Temporary)

spark.conf.set("spark.sql.shuffle.partitions", 200)

Cluster Level (Permanent)

  • Databricks β†’ Compute β†’ Cluster β†’ Edit β†’ Advanced Options β†’ Spark Config:
spark.sql.shuffle.partitions=200

You can check the default value of spark.sql.shuffle.partitions in Databricks (or any Spark environment) with a simple command:


In a Databricks Notebook

spark.conf.get("spark.sql.shuffle.partitions")

or

spark.conf.get("spark.sql.shuffle.partitions", None)

Example Output

'200'

This means the default shuffle partition count is 200 for your cluster.


In SQL Cell

SET spark.sql.shuffle.partitions;

This will return the current configured value.


From Spark UI

  1. Go to Spark UI from your Databricks job or cluster.
  2. Click Environment tab.
  3. Search for spark.sql.shuffle.partitions in the list of configurations.
  4. The value displayed there is the active one (could be default or overridden in cluster settings).

You can check the default value of spark.sql.shuffle.partitions in Databricks (or any Spark environment) with a simple command:


In a Databricks Notebook

spark.conf.get("spark.sql.shuffle.partitions")

or

spark.conf.get("spark.sql.shuffle.partitions", None)

Example Output

'200'

This means the default shuffle partition count is 200 for your cluster.


In SQL Cell

SET spark.sql.shuffle.partitions;

This will return the current configured value.


From Spark UI

  1. Go to Spark UI from your Databricks job or cluster.
  2. Click Environment tab.
  3. Search for spark.sql.shuffle.partitions in the list of configurations.
  4. The value displayed there is the active one (could be default or overridden in cluster settings).

πŸ“Œ Pro Tip
Pick the number based on cluster size & data volume:

  • Small data: 50–200
  • Medium data (~100GB): 200–800
  • Large data (>1TB): 800–2000

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments