,

Fixing “CLUSTER002 – Cluster Termination Failed” in Databricks

Posted by

Introduction

The CLUSTER002 – Cluster Termination Failed error in Databricks occurs when a cluster fails to shut down due to misconfigurations, resource constraints, or cloud provider-related issues.

🚨 Common symptoms of CLUSTER002 errors:

  • Cluster remains in a terminating state indefinitely.
  • “CLUSTER002 – Cluster termination failed” appears in logs.
  • The cluster cannot be restarted or deleted.
  • Dependent jobs or libraries fail due to an unresponsive cluster.

This guide explains common causes, troubleshooting steps, and solutions to fix the CLUSTER002 error and properly terminate your Databricks cluster.


1. Check if Any Active Jobs Are Preventing Termination

Symptoms:

  • The cluster does not shut down even after multiple attempts.
  • Jobs are still running on the cluster, causing termination failures.

Causes:

  • Scheduled or stuck jobs are preventing cluster shutdown.
  • Interactive notebooks remain attached, keeping the cluster active.

Fix:

Check active jobs running on the cluster:

databricks jobs list --active-only

Manually stop any running jobs:

databricks jobs cancel-run --run-id <job-run-id>

Detach all notebooks before shutting down the cluster:

databricks workspace detach-notebook --cluster-id <cluster-id>

Terminate the cluster after stopping jobs:

databricks clusters delete --cluster-id <cluster-id>

2. Verify If the Cluster Is Stuck in a Terminating State

Symptoms:

  • The cluster remains in “Terminating” state for an extended time.
  • Cannot manually restart or delete the cluster.

Causes:

  • The Databricks control plane lost communication with the cluster.
  • Underlying cloud provider (AWS, Azure, GCP) is not responding to shutdown requests.

Fix:

Check cluster status using Databricks CLI:

databricks clusters get --cluster-id <cluster-id>

Manually force cluster termination:

databricks clusters permanent-delete --cluster-id <cluster-id>

Restart Databricks workspace if needed:

  • If the cluster is still stuck, restart the Databricks workspace (Admin required).

Check Cloud Provider Logs for Termination Issues:

  • AWS: Verify EC2 instances in AWS Console → EC2 → Instances.
  • Azure: Check cluster status in Azure Portal → Databricks.
  • GCP: Check VM status in GCP Console → Compute Engine.

3. Check If Auto-Termination Settings Are Incorrect

Symptoms:

  • Cluster fails to terminate automatically as per idle timeout settings.
  • The termination timeout is too high, causing long-running clusters.

Causes:

  • Incorrect auto-termination settings prevent shutdown.
  • Idle cluster settings are misconfigured.

Fix:

Verify the cluster’s auto-termination settings:

  1. Go to Databricks UI → Clusters → Edit Cluster
  2. Check Auto-Termination Timeout (Set it to a lower value, e.g., 30 minutes).
  3. Save the settings and restart the cluster.

Manually trigger auto-termination using CLI:

databricks clusters edit --json '{
  "cluster_id": "<cluster-id>",
  "autotermination_minutes": 30
}'

If auto-termination is not working, manually terminate the cluster:

databricks clusters delete --cluster-id <cluster-id>

4. Fix Termination Failures Due to Cloud Provider Errors

Symptoms:

  • Error: “Cluster termination failed due to cloud provider error.”
  • Databricks logs show AWS, Azure, or GCP-related errors.

Causes:

  • Cloud resources (EC2 instances, VMs, storage) are not responding to termination requests.
  • IAM permissions prevent Databricks from shutting down cloud instances.

Fix:

Check Cloud Provider Logs:

  • AWS: Check EC2 instance status in AWS Console → EC2 → Instances.
  • Azure: Check Azure VM logs in Azure Portal → Virtual Machines.
  • GCP: Check Compute Engine logs in GCP Console → Compute Engine.

Force termination of stuck cloud resources manually:

  • AWS: aws ec2 terminate-instances --instance-ids <instance-id>
  • Azure: az vm delete --resource-group <resource-group> --name <vm-name>
  • GCP: gcloud compute instances delete <instance-name>

Check IAM permissions for terminating cloud resources:

  • Ensure Databricks has permission to terminate instances.
  • AWS IAM Policy Example:
{
  "Effect": "Allow",
  "Action": ["ec2:TerminateInstances", "s3:DeleteObject"],
  "Resource": "*"
}
}
  • Azure Role Assignment Example:
az role assignment create --assignee <databricks-service-principal> --role "Virtual Machine Contributor"

5. Resolve Cluster Termination Failures Due to Mounted Storage

Symptoms:

  • Error: “Cluster termination failed due to active DBFS mounts.”
  • Cluster remains active because mounted storage is in use.

Causes:

  • DBFS mounts (S3, ADLS, GCS) are preventing shutdown.
  • Databricks cannot unmount active storage during termination.

Fix:

List active DBFS mounts:

dbutils.fs.mounts()

Unmount all mounted storage before terminating the cluster:

dbutils.fs.unmount("/mnt/my-mount")

Retry terminating the cluster after unmounting storage:

databricks clusters delete --cluster-id <cluster-id>

6. Force Termination Using Databricks REST API

If the cluster is completely unresponsive, use Databricks REST API to force termination.

List running clusters:

curl -X GET -H "Authorization: Bearer <DATABRICKS_TOKEN>" \
"https://<databricks-instance>/api/2.0/clusters/list"

Force delete the stuck cluster:

curl -X POST -H "Authorization: Bearer <DATABRICKS_TOKEN>" \
"https://<databricks-instance>/api/2.0/clusters/permanent-delete" \
-d '{"cluster_id": "<cluster-id>"}'

Step-by-Step Troubleshooting Summary

1️⃣ Check for Active Jobs Preventing Termination

databricks jobs list --active-only
databricks jobs cancel-run --run-id <job-run-id>

2️⃣ Verify If Cluster Is Stuck in Terminating State

databricks clusters get --cluster-id <cluster-id>
databricks clusters permanent-delete --cluster-id <cluster-id>

3️⃣ Adjust Auto-Termination Settings

  • Set Auto-Termination Timeout to a lower value (e.g., 30 minutes).

4️⃣ Check Cloud Provider Logs for Errors

  • AWS: Terminate stuck EC2 instances.
  • Azure: Check VM status and permissions.
  • GCP: Verify Compute Engine instance status.

5️⃣ Unmount Active Storage Before Termination

dbutils.fs.unmount("/mnt/my-mount")

6️⃣ Force Termination Using REST API

curl -X POST -H "Authorization: Bearer <DATABRICKS_TOKEN>" \
"https://<databricks-instance>/api/2.0/clusters/permanent-delete" \
-d '{"cluster_id": "<cluster-id>"}'

Best Practices to Prevent CLUSTER002 Errors

Enable Auto-Termination for Clusters

  • Prevent long-running idle clusters by setting termination timeouts.

Avoid Keeping Jobs or Notebooks Attached Indefinitely

  • Detach notebooks before terminating clusters.

Use Proper IAM Permissions for Cloud Provider Integration

  • Ensure Databricks has termination access for AWS, Azure, or GCP resources.

Monitor Cluster Logs and Termination Requests

  • Regularly review cluster logs for failed termination attempts.

Conclusion

If your Databricks cluster fails to terminate with CLUSTER002 errors, check:
Active jobs preventing shutdown.
Stuck cluster states in the control plane.
Cloud provider issues (AWS, Azure, GCP).
DBFS mounts that may block termination.

By following this guide, you can successfully force terminate a Databricks cluster and prevent future termination failures.

guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x