Mohammad Gufran Jahangir September 28, 2025 0

How to set min/max nodes, use termination settings, mix spot/preemptible nodes, and avoid “yo-yo” scaling


Why Autoscaling Is Tricky

Autoscaling is one of Databricks’ most powerful features—but many teams misuse it.

  • Set it wrong, and you pay for idle nodes.
  • Tune it poorly, and jobs yo-yo between scaling up and down, wasting both time and money.

This blog breaks down how to configure autoscaling in a way that balances cost and performance, giving you control instead of surprises.


Step 1: Get the Basics Right – Min & Max Nodes

When you enable autoscaling, you must define:

  • Min Nodes: the minimum workers that always run.
  • Max Nodes: the upper limit that autoscaling can reach.

👉 Best Practices:

  • ETL / Batch Jobs:
    • Min: 2–3 nodes (enough to handle small runs).
    • Max: Scale high (20+), since big shuffles need parallelism.
  • ML Training:
    • Min: 1–2 nodes (avoid idle GPU cost).
    • Max: Small increase (4–8), scaling up isn’t as efficient for training.
  • SQL / Dashboards:
    • Min: Enough to serve baseline concurrency (3–5).
    • Max: Roughly 2–3x min, since queries benefit from predictable performance.

💡 Rule of Thumb:
Set min close to your baseline demand and max to cover peak bursts, not infinity.


Step 2: Termination Settings – Kill Idle Clusters

Autoscaling is only half the battle. The other half is cluster termination.

  • Idle Timeout: The cluster shuts down after X minutes of inactivity.

👉 Guidelines:

  • Job Clusters: Always set short (10–15 mins). They spin up on demand anyway.
  • All-Purpose Clusters: 30–60 mins depending on how often teams use them.

⚠️ Mistake to Avoid: Leaving clusters running “just in case.” That’s how bills double.


Step 3: Spot/Preemptible Instances – The Cost Hack

Most ETL workloads can tolerate a few node losses. That’s where spot/preemptible VMs shine:

  • 60–70% cheaper than on-demand.
  • Perfect for batch jobs with retries.

👉 Mixing Strategy:

  • ETL Jobs: Use 50–70% spot, 30–50% on-demand for stability.
  • Critical SQL/ML Jobs: Avoid spot for high-priority clusters.

💡 Use Instance Pools to pre-warm both spot and on-demand VMs for faster startup.


Step 4: Avoid “Yo-Yo” Scaling

Yo-yo scaling = when clusters constantly add/remove nodes, wasting time on repeated shuffles.

👉 How to Prevent It:

  1. Set a Reasonable Min Size: If you set min=1, Spark will spend forever scaling up and reshuffling.
  2. Wider Scaling Range: Instead of min=2, max=3, give room (e.g., min=2, max=10).
  3. Optimize Data Layout: Small files, skew, and unpartitioned tables force excessive scaling.
  4. Use AQE (Adaptive Query Execution): Spark 3.x adjusts joins/shuffles on the fly, reducing unnecessary scale-out.

Step 5: Monitoring & Feedback Loop

Autoscaling is not “set and forget.” You need to monitor:

  • CPU Utilization: 60–80% = healthy. <40% = over-provisioned.
  • Memory Utilization: >90% = scale up node size.
  • Shuffle Metrics: Too high = repartition data.
  • Cluster Events Logs: Shows when and why scaling events happen.

👉 Build a Cost vs. Performance Dashboard using:

  • system.compute.node_timeline for CPU/memory usage.
  • observability.v_job_runs for job duration.
  • system.access.audit for user-level usage.

Step 6: Decision Matrix – Quick Reference

WorkloadMin NodesMax NodesIdle TimeoutSpot UsageNotes
ETL2–320+10–15 min50–70%Big parallelism, retry-friendly
ML Training1–24–815–30 min0–10%Use GPUs, prefer stability
SQL / Dashboards3–56–1530–60 min0%Prioritize concurrency & responsiveness

Closing Thoughts

Autoscaling is a double-edged sword:

  • Configured well: It saves thousands in cloud spend.
  • Configured poorly: It causes unstable jobs and inflated bills.

By tuning min/max nodes, enforcing termination, smartly mixing spot instances, and avoiding yo-yo scaling, you take control of performance and cost.


Action Plan for You:

  1. Review your top 3 clusters today—are min/max nodes realistic?
  2. Set termination timeouts if not already configured.
  3. Test spot instance mix on one ETL pipeline.
  4. Monitor scaling events weekly and adjust policies.

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments