How to pick nodes, cores, memory, and disk for ETL vs. ML vs. SQL—and when to scale up vs. out

Why Cluster Right-Sizing Matters
Databricks gives us the power of Spark in the cloud—but power comes with a price.
- Oversized clusters = wasted money.
- Undersized clusters = failed jobs and user frustration.
The art of right-sizing clusters is balancing performance, stability, and cost efficiency. Think of it like choosing the right vehicle:
- Don’t rent a bus for two passengers.
- Don’t take a scooter for a mountain trek.
This blog will give you a step-by-step approach to choosing the right nodes, cores, memory, and disk for your workloads, and help you decide when to scale up vs. scale out.
Step 1: Know Your Workload Type
Clusters aren’t “one size fits all.” Your use case determines your cluster shape.
🔹 1. ETL / Data Engineering
- Nature: Long-running batch jobs, heavy I/O, lots of shuffles (joins, groupBy).
- Needs: High memory + strong I/O throughput.
- Best Fit:
- Worker Type: Compute-optimized or balanced VMs.
- Cores: More cores for parallelism.
- Memory: 4–8 GB per core is a sweet spot.
- Disk: Large local SSDs for shuffle-heavy workloads.
- Pro Tip: Use Autoscaling clusters with a small min and larger max node count—Spark scales better horizontally for ETL.
🔹 2. ML / Data Science
- Nature: Model training, experimentation, feature engineering.
- Needs: GPU/accelerator if training deep learning; otherwise balanced CPU + memory.
- Best Fit:
- Worker Type: GPU-enabled for DL, memory-optimized for classical ML.
- Cores: Fewer cores but more RAM per node.
- Memory: >16 GB per executor recommended.
- Disk: Moderate, unless working with huge local feature sets.
- Pro Tip: Use Job Clusters for training runs to avoid idle cost.
🔹 3. SQL / BI Workloads
- Nature: Interactive dashboards, ad-hoc queries, concurrency-sensitive.
- Needs: Fast response, high concurrency, cache efficiency.
- Best Fit:
- Worker Type: Memory-optimized nodes (Photon runtime recommended).
- Cores: Higher core count for concurrent queries.
- Memory: 8–16 GB per core.
- Disk: Less critical (queries hit Delta storage, not shuffle-heavy).
- Pro Tip: Use DBSQL Serverless for unpredictable query bursts; let Databricks auto-scale for you.
Step 2: Scale Up vs. Scale Out
Now the million-dollar question: Should I use bigger nodes or more nodes?
- Scale Up (Bigger Nodes):
✅ Good for ML (large models, big memory footprint).
✅ Reduces shuffle overhead since fewer nodes exchange data.
❌ More risk—if one node dies, more workload is lost. - Scale Out (More Nodes):
✅ Good for ETL/SQL (parallelism, high throughput).
✅ Resilient—failure of 1 node is absorbed.
❌ Can hit shuffle bottlenecks if not tuned.
👉 Rule of Thumb:
- ETL / SQL = Scale Out
- ML / DL = Scale Up
Step 3: Cluster Sizing Framework
Here’s a 3-step formula you can apply today:
- Baseline:
- Start with 1 driver + 2 workers.
- Worker size: 4 cores, 16 GB RAM (ETL/SQL) or 8 cores, 64 GB RAM (ML).
- Measure:
- Monitor CPU % (keep it 60–80%).
- Monitor memory (avoid >90% usage, watch for OOM).
- Look at shuffle read/write (if high, increase disk or repartition data).
- Adjust:
- If CPU is low but jobs are slow → need more nodes (scale out).
- If CPU is pegged but memory is fine → need bigger nodes (scale up).
- If memory is maxed out → memory-optimized node type.
Step 4: Admin Best Practices
- Instance Pools: Reduce cluster start time by pre-warming nodes.
- Cluster Policies: Prevent users from creating “monster clusters” that blow budgets.
- Photon Runtime: Enable for SQL and ETL to cut query times 2–3x.
- Spot VMs: Use for non-critical ETL jobs to save 60–70%.
- Delta OPTIMIZE: Keep file sizes healthy to reduce shuffle burden.
Step 5: Decision Matrix (Quick Reference)
| Workload | Scale Strategy | Node Type | CPU/Memory Ratio | Disk Needs | Notes |
|---|---|---|---|---|---|
| ETL / Batch | Scale Out | Compute-optimized | 4–8 GB per core | High (shuffle) | Use autoscaling, Photon |
| ML Training | Scale Up | Memory/GPU optimized | 16+ GB per core | Medium | Job clusters, GPUs |
| SQL / BI | Scale Out | Memory-optimized | 8–16 GB per core | Low | DBSQL Serverless, cache |
Closing Thoughts
Right-sizing clusters in Databricks is about knowing your workload and monitoring your metrics. Don’t guess—measure. Start small, observe, and adjust.
With the right sizing approach:
- Your ETL jobs finish on time.
- Your ML models train without crashing.
- Your SQL dashboards stay snappy.
- Your finance team thanks you for the lower bill.
✨ Next Steps for You:
- Audit your top 5 jobs today—are they right-sized?
- Build a cluster policy to enforce guardrails.
- Set up a dashboard tracking CPU %, memory, and shuffle to continuously tune.