Introduction
Databricks leverages Spot Instances (AWS) and Preemptible VMs (Azure/GCP) to reduce costs, but these instances can be terminated unexpectedly due to availability issues. Worker node failures (CLUSTER004 errors) disrupt running jobs, cause data loss, shuffle failures, and performance slowdowns.
🚨 Common issues caused by worker node failures (Spot instance terminations):
- Jobs fail intermittently with CLUSTER004 errors.
- Performance degradation due to frequent worker node loss.
- Shuffle failures or lost executors leading to incomplete computations.
- Long job execution times due to automatic instance replacement.
This guide explores troubleshooting steps and best practices to prevent worker node failures and ensure stable job execution.
1. Understanding CLUSTER004 – Spot Instance Termination
What Causes Spot Instance Terminations?
- Cloud provider reclaims Spot instances when demand is high.
- Insufficient capacity in the selected availability zone.
- Worker nodes not replaced fast enough, leading to job failures.
- Cluster autoscaling issues preventing new instances from being provisioned.
- Long-running jobs using spot instances risk losing progress when interrupted.
💡 By default, Databricks tries to replace terminated Spot instances, but frequent terminations impact stability and performance.
2. Diagnosing Spot Instance Termination Issues
✅ Check Recent Cluster Events
- Go to Databricks UI → Clusters → Event Log.
- Look for events like:
- “Worker instance terminated due to spot instance reclaim.”
- “Worker node failure detected, attempting replacement.”
✅ Check Cloud Provider Spot Instance Status
- AWS: Monitor Spot instance interruptions using:
aws ec2 describe-spot-instance-requests
- Azure: Check Preemptible VM evictions using:
az vm list-skus --location eastus --query "[?name=='Standard_D8s_v3']"
- GCP: List Preemptible VM usage:
gcloud compute instances list --filter="status: TERMINATED"
✅ Monitor Cluster Autoscaling and Worker Node Replacement
- Go to Databricks UI → Clusters → Scaling Events.
- If replacement is slow or fails, check instance capacity limits.
3. Solutions to Prevent Spot Instance Failures in Databricks
Option 1: Use On-Demand Instances Instead of Spot Instances
Symptoms:
- Spot instances terminate frequently, impacting stability.
- Long-running jobs fail unexpectedly due to worker loss.
Fix:
✅ Force Databricks to use On-Demand Instances instead of Spot:
- Go to Databricks UI → Cluster Settings → Worker Type.
- Uncheck “Use Spot Instances” to switch to On-Demand instances.
✅ For Automated Cluster Configuration:
{
"aws_attributes": {
"availability": "ON_DEMAND"
}
}
✅ For Azure: Set priority to “Regular” instead of “Spot” in VM options.
✅ For GCP: Disable Preemptible Instances in cluster settings.
Option 2: Use Mixed Instance Types (Spot + On-Demand Workers)
Symptoms:
- Frequent Spot terminations, but costs need to remain low.
- Cluster stability is affected, but full On-Demand usage is too expensive.
Fix:
✅ Enable a mix of On-Demand and Spot instances:
{
"aws_attributes": {
"availability": "SPOT_WITH_FALLBACK"
}
}
✅ This allows Databricks to use On-Demand instances when Spot capacity is low.
✅ Azure: Use Low-Priority VM fallback to Regular VMs.
✅ GCP: Enable Preemptible VM fallback to regular VMs in instance settings.
Option 3: Increase Worker Node Replacement Speed
Symptoms:
- Worker nodes take too long to be replaced, causing job failures.
- Shuffle operations fail due to missing worker nodes.
Fix:
✅ Increase cluster autoscaling limits to allow faster worker replacement:
{
"autoscale": {
"min_workers": 2,
"max_workers": 10
}
}
✅ Ensure Databricks Auto-Termination is not removing nodes too aggressively.
✅ Use smaller instance types to improve replacement speed.
Option 4: Spread Spot Instances Across Multiple Availability Zones
Symptoms:
- Spot instance availability varies by region, causing failures.
- Some availability zones have fewer Spot instances available.
Fix:
✅ Enable Multi-AZ (Multi-Availability Zone) Deployment:
{
"aws_attributes": {
"availability": "SPOT_PREFERRED"
}
}
✅ AWS: Choose instance types available in multiple zones to increase reliability.
✅ Azure: Distribute VM allocation across multiple zones.
✅ GCP: Use multi-region Preemptible VM allocations.
Option 5: Implement Fault-Tolerant Job Strategies
Symptoms:
- Long-running jobs fail when Spot instances terminate.
- No retry mechanism is in place for failed jobs.
Fix:
✅ Enable Checkpointing for Streaming Jobs to Prevent Data Loss:
df.writeStream.option("checkpointLocation", "/mnt/checkpoints/").start()
✅ Enable Fault-Tolerant Execution for Critical Jobs:
{
"spark.databricks.cluster.enableAutoscaling": "true"
}
✅ Use Databricks Job Retries for Resiliency:
- Go to Databricks UI → Jobs → Retry Policy → Enable Retries
- Configure Automatic Retries for job failures.
4. Step-by-Step Troubleshooting Guide
1. Check Recent Spot Termination Logs
- Go to Databricks UI → Clusters → Event Log
- Look for “Spot instance terminated” messages.
2. Verify If Cluster Has Sufficient Replacement Capacity
aws ec2 describe-spot-price-history --instance-types m5.xlarge
- If no capacity is available, switch to On-Demand or a different instance type.
3. Test Job Execution with On-Demand Workers Only
{
"aws_attributes": {
"availability": "ON_DEMAND"
}
}
- If the job runs fine, Spot Instances were causing instability.
4. Enable Mixed Spot + On-Demand Instances and Retry
{
"aws_attributes": {
"availability": "SPOT_WITH_FALLBACK"
}
}
- If job stability improves, use this configuration for cost savings.
5. Best Practices to Prevent Spot Instance Failures in Databricks
✅ Use On-Demand Instances for Critical Workloads
- For long-running jobs, avoid Spot instances.
✅ Enable Mixed Spot + On-Demand Instances for Cost Savings
- Use Spot with fallback to On-Demand instances.
✅ Spread Spot Instances Across Multiple Availability Zones
- Ensure Databricks clusters can provision in multiple zones.
✅ Enable Checkpointing and Auto-Retry for Job Resilience
- Protect long-running jobs from worker failures.
✅ Monitor Spot Instance Capacity and Replace Workers Faster
- Increase autoscaling limits and use smaller instance types.
6. Conclusion
Spot instance failures (CLUSTER004 – Worker Node Failure) can cause job failures, performance issues, and shuffle errors in Databricks. To prevent disruptions, use:
✅ On-Demand instances for critical workloads.
✅ Spot with fallback for cost savings.
✅ Multi-AZ deployments to improve availability.
✅ Job retries, checkpointing, and autoscaling for resiliency.
By optimizing cluster settings, monitoring spot capacity, and using fault-tolerant execution, Databricks workloads can run efficiently without interruptions.