Mohammad Gufran Jahangir August 7, 2025 0

๐Ÿ” What is spark.sql.shuffle.partitions?

spark.sql.shuffle.partitions is a Spark SQL configuration parameter that controls the number of output partitions created during shuffling operations, such as:

  • Joins
  • Aggregations (e.g., groupBy)
  • Window functions
  • distinct
  • Sorting

๐Ÿง  Why is it important?

Shuffling is one of the most expensive operations in Spark, involving:

  • Reading data across partitions
  • Writing to disk
  • Sending over the network
  • Repartitioning data for processing

So, the number of shuffle partitions has a direct impact on:

AspectImpact of Shuffle Partitions
PerformanceToo many โ†’ task overhead; Too few โ†’ large partitions (slow)
Memory usageToo many โ†’ memory pressure due to task proliferation
ParallelismMore partitions โ†’ better parallelism if cluster has enough cores
Disk + GC pressureHigh number โ†’ more shuffle files, possible GC pressure
Resource utilizationMust match your executor/core settings for efficiency

โš™๏ธ How does it work?

Letโ€™s say you do a groupBy or a join:

df.groupBy("country").count()

This triggers a shuffle, and Spark will use the number set in spark.sql.shuffle.partitions to divide the intermediate output.

  • Default is 200 (in most Spark/Databricks setups)
  • Each output partition becomes a task in the next stage

Example:

If set to 200:

  • Spark will create 200 output partitions from the shuffle stage
  • Next stage will process these 200 tasks
  • If cluster has 16 cores, 16 tasks run in parallel

๐Ÿ› ๏ธ When and How to Tune It

ScenarioRecommendation
Small Dataset (few MB/GB)๐Ÿ”ฝ Reduce partitions (e.g., spark.conf.set("spark.sql.shuffle.partitions", 10)) to avoid too many empty/small tasks
Large Dataset (100+ GB or TB scale)๐Ÿ”ผ Increase partitions (e.g., 500โ€“1000) to ensure no partition is too large
Job has skewed performanceโœ… Tune this alongside salting, repartition, or broadcast join strategies
You see long-running tasks (especially post-shuffle)โณ Could be too few partitions โ†’ increase
You see high scheduling delay or too many tasks pending๐Ÿ”„ Could be too many partitions โ†’ decrease or match with cores

๐Ÿ’ก Rule of Thumb

  • Shuffle partitions โ‰ˆ 2x to 4x the number of executor cores
  • For 20 executor cores โ†’ set it to 40โ€“80
  • Donโ€™t blindly increase to 1000+ unless youโ€™re handling very large data

โœ… How to Check and Set It

# Check current value
spark.conf.get("spark.sql.shuffle.partitions")

# Set a new value (effective for current Spark session)
spark.conf.set("spark.sql.shuffle.partitions", 100)

To set permanently for a cluster, define in:

  • Cluster Spark Config
  • Init script
  • Pipeline settings

๐Ÿ” Example Before/After Optimization

StageBefore TuningAfter Tuning
Shuffle partitions20050
Execution time10 min4 min
GC time2 min0.5 min
Failed tasks60

๐Ÿ“ฆ Related Parameters

  • spark.default.parallelism: Used for RDD operations
  • spark.sql.files.maxPartitionBytes: Controls partition size before shuffling
  • spark.dynamicAllocation.*: Can help tune executor count

spark.sql.shuffle.partitions can be defined at multiple levels


๐Ÿ”น 1. Notebook Level (Session Scope) โœ…

You can define it within a notebook using the Spark configuration API. This change will apply only to the current Spark session (i.e., until the cluster restarts or notebook ends).

โœ… Example:

spark.conf.set("spark.sql.shuffle.partitions", 50)

๐Ÿ” To verify:

spark.conf.get("spark.sql.shuffle.partitions")

๐Ÿ”น 2. Job/Script Level (e.g., PySpark or Scala jobs)

If you’re submitting a job via a pipeline or job scheduler, you can set this in your job script before any operation that triggers a shuffle.

Python:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyApp") \
    .config("spark.sql.shuffle.partitions", "100") \
    .getOrCreate()

๐Ÿ”น 3. Cluster Level (Default for All Jobs/Notebooks on the Cluster)

Set this in the cluster configuration to make it the default for all notebooks and jobs running on the cluster.

On Databricks:

  1. Go to Compute โ†’ choose your cluster โ†’ Configuration
  2. In Spark Config, add: spark.sql.shuffle.partitions=200
  3. Restart the cluster

๐Ÿ”น 4. Job Cluster / Job Settings (Databricks Jobs)

If you are scheduling a job (via UI or REST API), you can include this config under the spark_conf block in the job definition.

{
  "spark_conf": {
    "spark.sql.shuffle.partitions": "300"
  }
}

๐Ÿ”น 5. Init Script Level (Advanced)

For production-level tuning, you can use init scripts to set this during cluster startup. This ensures every new cluster created from a pool or job template has the setting.


๐Ÿ“Œ Summary Table

ScopeWhere to SetApplies To
Notebookspark.conf.set(...)That notebook only (session)
Script/Code.config(...) in SparkSession builderThat specific job or script
ClusterSpark Config section in Databricks UIAll jobs & notebooks on that cluster
Job Definitionspark_conf in job configThat specific scheduled job
Init ScriptStartup script for clusterAll clusters using that init script

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments