CLUSTER004 – Worker Node Failure (Spot Instance Terminated) in Databricks

Mohammad Gufran Jahangir February 3, 2025 0

Table of Contents

Introduction

Databricks leverages Spot Instances (AWS) and Preemptible VMs (Azure/GCP) to reduce costs, but these instances can be terminated unexpectedly due to availability issues. Worker node failures (CLUSTER004 errors) disrupt running jobs, cause data loss, shuffle failures, and performance slowdowns.

🚨 Common issues caused by worker node failures (Spot instance terminations):

Jobs fail intermittently with CLUSTER004 errors.
Performance degradation due to frequent worker node loss.
Shuffle failures or lost executors leading to incomplete computations.
Long job execution times due to automatic instance replacement.

This guide explores troubleshooting steps and best practices to prevent worker node failures and ensure stable job execution.

1. Understanding CLUSTER004 – Spot Instance Termination

What Causes Spot Instance Terminations?

Cloud provider reclaims Spot instances when demand is high.
Insufficient capacity in the selected availability zone.
Worker nodes not replaced fast enough, leading to job failures.
Cluster autoscaling issues preventing new instances from being provisioned.
Long-running jobs using spot instances risk losing progress when interrupted.

💡 By default, Databricks tries to replace terminated Spot instances, but frequent terminations impact stability and performance.

2. Diagnosing Spot Instance Termination Issues

✅ Check Recent Cluster Events

Go to Databricks UI → Clusters → Event Log.
Look for events like:
- “Worker instance terminated due to spot instance reclaim.”
- “Worker node failure detected, attempting replacement.”

✅ Check Cloud Provider Spot Instance Status

AWS: Monitor Spot instance interruptions using:

aws ec2 describe-spot-instance-requests

Azure: Check Preemptible VM evictions using:

az vm list-skus --location eastus --query "[?name=='Standard_D8s_v3']"

GCP: List Preemptible VM usage:

gcloud compute instances list --filter="status: TERMINATED"

✅ Monitor Cluster Autoscaling and Worker Node Replacement

Go to Databricks UI → Clusters → Scaling Events.
If replacement is slow or fails, check instance capacity limits.

3. Solutions to Prevent Spot Instance Failures in Databricks

Option 1: Use On-Demand Instances Instead of Spot Instances

Symptoms:

Spot instances terminate frequently, impacting stability.
Long-running jobs fail unexpectedly due to worker loss.

Fix:
✅ Force Databricks to use On-Demand Instances instead of Spot:

Go to Databricks UI → Cluster Settings → Worker Type.
Uncheck “Use Spot Instances” to switch to On-Demand instances.

✅ For Automated Cluster Configuration:

{
  "aws_attributes": {
    "availability": "ON_DEMAND"
  }
}

✅ For Azure: Set priority to “Regular” instead of “Spot” in VM options.
✅ For GCP: Disable Preemptible Instances in cluster settings.

Option 2: Use Mixed Instance Types (Spot + On-Demand Workers)

Symptoms:

Frequent Spot terminations, but costs need to remain low.
Cluster stability is affected, but full On-Demand usage is too expensive.

Fix:
✅ Enable a mix of On-Demand and Spot instances:

{
  "aws_attributes": {
    "availability": "SPOT_WITH_FALLBACK"
  }
}

✅ This allows Databricks to use On-Demand instances when Spot capacity is low.
✅ Azure: Use Low-Priority VM fallback to Regular VMs.
✅ GCP: Enable Preemptible VM fallback to regular VMs in instance settings.

Option 3: Increase Worker Node Replacement Speed

Symptoms:

Worker nodes take too long to be replaced, causing job failures.
Shuffle operations fail due to missing worker nodes.

Fix:
✅ Increase cluster autoscaling limits to allow faster worker replacement:

{
  "autoscale": {
    "min_workers": 2,
    "max_workers": 10
  }
}

✅ Ensure Databricks Auto-Termination is not removing nodes too aggressively.
✅ Use smaller instance types to improve replacement speed.

Option 4: Spread Spot Instances Across Multiple Availability Zones

Symptoms:

Spot instance availability varies by region, causing failures.
Some availability zones have fewer Spot instances available.

Fix:
✅ Enable Multi-AZ (Multi-Availability Zone) Deployment:

{
  "aws_attributes": {
    "availability": "SPOT_PREFERRED"
  }
}

✅ AWS: Choose instance types available in multiple zones to increase reliability.
✅ Azure: Distribute VM allocation across multiple zones.
✅ GCP: Use multi-region Preemptible VM allocations.

Option 5: Implement Fault-Tolerant Job Strategies

Symptoms:

Long-running jobs fail when Spot instances terminate.
No retry mechanism is in place for failed jobs.

Fix:
✅ Enable Checkpointing for Streaming Jobs to Prevent Data Loss:

df.writeStream.option("checkpointLocation", "/mnt/checkpoints/").start()

✅ Enable Fault-Tolerant Execution for Critical Jobs:

{
  "spark.databricks.cluster.enableAutoscaling": "true"
}

✅ Use Databricks Job Retries for Resiliency:

Go to Databricks UI → Jobs → Retry Policy → Enable Retries
Configure Automatic Retries for job failures.

4. Step-by-Step Troubleshooting Guide

1. Check Recent Spot Termination Logs

Go to Databricks UI → Clusters → Event Log
Look for “Spot instance terminated” messages.

2. Verify If Cluster Has Sufficient Replacement Capacity

aws ec2 describe-spot-price-history --instance-types m5.xlarge

If no capacity is available, switch to On-Demand or a different instance type.

3. Test Job Execution with On-Demand Workers Only

{
  "aws_attributes": {
    "availability": "ON_DEMAND"
  }
}

If the job runs fine, Spot Instances were causing instability.

4. Enable Mixed Spot + On-Demand Instances and Retry

{
  "aws_attributes": {
    "availability": "SPOT_WITH_FALLBACK"
  }
}

If job stability improves, use this configuration for cost savings.

5. Best Practices to Prevent Spot Instance Failures in Databricks

✅ Use On-Demand Instances for Critical Workloads

For long-running jobs, avoid Spot instances.

✅ Enable Mixed Spot + On-Demand Instances for Cost Savings

Use Spot with fallback to On-Demand instances.

✅ Spread Spot Instances Across Multiple Availability Zones

Ensure Databricks clusters can provision in multiple zones.

✅ Enable Checkpointing and Auto-Retry for Job Resilience

Protect long-running jobs from worker failures.

✅ Monitor Spot Instance Capacity and Replace Workers Faster

Increase autoscaling limits and use smaller instance types.

6. Conclusion

Spot instance failures (CLUSTER004 – Worker Node Failure) can cause job failures, performance issues, and shuffle errors in Databricks. To prevent disruptions, use:
✅ On-Demand instances for critical workloads.
✅ Spot with fallback for cost savings.
✅ Multi-AZ deployments to improve availability.
✅ Job retries, checkpointing, and autoscaling for resiliency.

By optimizing cluster settings, monitoring spot capacity, and using fault-tolerant execution, Databricks workloads can run efficiently without interruptions.

Mohammad Gufran Jahangir

Tags: Databricks

Category: