Mohammad Gufran Jahangir September 28, 2025 0

How to pick nodes, cores, memory, and disk for ETL vs. ML vs. SQL—and when to scale up vs. out


Why Cluster Right-Sizing Matters

Databricks gives us the power of Spark in the cloud—but power comes with a price.

  • Oversized clusters = wasted money.
  • Undersized clusters = failed jobs and user frustration.

The art of right-sizing clusters is balancing performance, stability, and cost efficiency. Think of it like choosing the right vehicle:

  • Don’t rent a bus for two passengers.
  • Don’t take a scooter for a mountain trek.

This blog will give you a step-by-step approach to choosing the right nodes, cores, memory, and disk for your workloads, and help you decide when to scale up vs. scale out.


Step 1: Know Your Workload Type

Clusters aren’t “one size fits all.” Your use case determines your cluster shape.

🔹 1. ETL / Data Engineering

  • Nature: Long-running batch jobs, heavy I/O, lots of shuffles (joins, groupBy).
  • Needs: High memory + strong I/O throughput.
  • Best Fit:
    • Worker Type: Compute-optimized or balanced VMs.
    • Cores: More cores for parallelism.
    • Memory: 4–8 GB per core is a sweet spot.
    • Disk: Large local SSDs for shuffle-heavy workloads.
  • Pro Tip: Use Autoscaling clusters with a small min and larger max node count—Spark scales better horizontally for ETL.

🔹 2. ML / Data Science

  • Nature: Model training, experimentation, feature engineering.
  • Needs: GPU/accelerator if training deep learning; otherwise balanced CPU + memory.
  • Best Fit:
    • Worker Type: GPU-enabled for DL, memory-optimized for classical ML.
    • Cores: Fewer cores but more RAM per node.
    • Memory: >16 GB per executor recommended.
    • Disk: Moderate, unless working with huge local feature sets.
  • Pro Tip: Use Job Clusters for training runs to avoid idle cost.

🔹 3. SQL / BI Workloads

  • Nature: Interactive dashboards, ad-hoc queries, concurrency-sensitive.
  • Needs: Fast response, high concurrency, cache efficiency.
  • Best Fit:
    • Worker Type: Memory-optimized nodes (Photon runtime recommended).
    • Cores: Higher core count for concurrent queries.
    • Memory: 8–16 GB per core.
    • Disk: Less critical (queries hit Delta storage, not shuffle-heavy).
  • Pro Tip: Use DBSQL Serverless for unpredictable query bursts; let Databricks auto-scale for you.

Step 2: Scale Up vs. Scale Out

Now the million-dollar question: Should I use bigger nodes or more nodes?

  • Scale Up (Bigger Nodes):
    ✅ Good for ML (large models, big memory footprint).
    ✅ Reduces shuffle overhead since fewer nodes exchange data.
    ❌ More risk—if one node dies, more workload is lost.
  • Scale Out (More Nodes):
    ✅ Good for ETL/SQL (parallelism, high throughput).
    ✅ Resilient—failure of 1 node is absorbed.
    ❌ Can hit shuffle bottlenecks if not tuned.

👉 Rule of Thumb:

  • ETL / SQL = Scale Out
  • ML / DL = Scale Up

Step 3: Cluster Sizing Framework

Here’s a 3-step formula you can apply today:

  1. Baseline:
    • Start with 1 driver + 2 workers.
    • Worker size: 4 cores, 16 GB RAM (ETL/SQL) or 8 cores, 64 GB RAM (ML).
  2. Measure:
    • Monitor CPU % (keep it 60–80%).
    • Monitor memory (avoid >90% usage, watch for OOM).
    • Look at shuffle read/write (if high, increase disk or repartition data).
  3. Adjust:
    • If CPU is low but jobs are slow → need more nodes (scale out).
    • If CPU is pegged but memory is fine → need bigger nodes (scale up).
    • If memory is maxed out → memory-optimized node type.

Step 4: Admin Best Practices

  • Instance Pools: Reduce cluster start time by pre-warming nodes.
  • Cluster Policies: Prevent users from creating “monster clusters” that blow budgets.
  • Photon Runtime: Enable for SQL and ETL to cut query times 2–3x.
  • Spot VMs: Use for non-critical ETL jobs to save 60–70%.
  • Delta OPTIMIZE: Keep file sizes healthy to reduce shuffle burden.

Step 5: Decision Matrix (Quick Reference)

WorkloadScale StrategyNode TypeCPU/Memory RatioDisk NeedsNotes
ETL / BatchScale OutCompute-optimized4–8 GB per coreHigh (shuffle)Use autoscaling, Photon
ML TrainingScale UpMemory/GPU optimized16+ GB per coreMediumJob clusters, GPUs
SQL / BIScale OutMemory-optimized8–16 GB per coreLowDBSQL Serverless, cache

Closing Thoughts

Right-sizing clusters in Databricks is about knowing your workload and monitoring your metrics. Don’t guess—measure. Start small, observe, and adjust.

With the right sizing approach:

  • Your ETL jobs finish on time.
  • Your ML models train without crashing.
  • Your SQL dashboards stay snappy.
  • Your finance team thanks you for the lower bill.

✨ Next Steps for You:

  • Audit your top 5 jobs today—are they right-sized?
  • Build a cluster policy to enforce guardrails.
  • Set up a dashboard tracking CPU %, memory, and shuffle to continuously tune.

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments