,

Cluster Creation Failed: CLUSTER001 – Capacity or Config Issue in Databricks

Posted by

Introduction

If your Databricks cluster fails to start with the error “CLUSTER001 – Cluster creation failed (capacity or config issue)”, it typically indicates insufficient resources, misconfigured settings, or cloud provider limitations.

🚨 Common causes of cluster creation failures:

  • Insufficient cloud provider capacity (AWS, Azure, or GCP).
  • Invalid instance type or region constraints.
  • Misconfigured IAM roles or permissions.
  • Incorrect Databricks runtime settings.

This guide covers troubleshooting steps and solutions to successfully start your cluster.


1. Check Cloud Provider Resource Availability

Symptoms:

  • Error: “CLUSTER001: Instance type is unavailable in the selected region.”
  • Cluster fails due to insufficient compute capacity in the cloud provider.
  • AWS: EC2 instance limits exceeded.
  • Azure: VM quota reached or SKU unavailable.
  • GCP: Insufficient CPU/GPU quota.

Fix:

Reduce cluster size or switch to an available instance type:

  • AWS:aws ec2 describe-instance-type-offerings --location-type region --region us-east-1
    • Try using a different instance type (e.g., switch from m5.xlarge to m5.2xlarge).
  • Azure:az vm list-sizes --location eastus
    • Request a quota increase if VM quota is exceeded: az vm quota update --resource-group my-rg --sku Standard_D4s_v3 --limit 10
  • GCP:gcloud compute regions describe us-central1
    • Switch to another region or use a different machine type.

Change cluster configuration in Databricks:

  • Go to Databricks UI → Clusters → Edit Cluster.
  • Change Instance Type to a smaller or more available machine.
  • Select a different cloud region if capacity issues persist.

2. Verify IAM Roles and Permissions

Symptoms:

  • Cluster fails with permission-related errors.
  • Error: “Permission denied: Cannot create cluster.”

Fix:

Ensure IAM roles have required permissions for Databricks cluster creation.

AWS IAM Policy:

{
  "Effect": "Allow",
  "Action": [
    "ec2:RunInstances",
    "ec2:DescribeInstances",
    "ec2:CreateTags",
    "s3:GetObject",
    "s3:ListBucket"
  ],
  "Resource": "*"
}

Azure Databricks Managed Identity:

az role assignment create --assignee <service-principal-id> --role "Contributor" --scope /subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Databricks/workspaces/<workspace-id>

GCP IAM Policy for Databricks Service Account:

gcloud projects add-iam-policy-binding <project-id> --member="serviceAccount:databricks-sa@<project-id>.iam.gserviceaccount.com" --role="roles/dataproc.worker"

3. Validate Databricks Runtime and Cluster Configurations

Symptoms:

  • Error: “Invalid cluster configuration.”
  • Cluster fails due to an unsupported Databricks runtime version.

Fix:

Ensure the Databricks runtime is supported:

  • Go to Databricks UI → Clusters → Edit Cluster.
  • Select a stable Databricks runtime (avoid preview or unsupported versions).
  • For GPU clusters, ensure correct CUDA drivers are installed.

Check Auto-Scaling and Node Configurations:

  • Reduce the minimum and maximum worker count to ensure the cluster starts.
  • Example configuration: { "autoscale": { "min_workers": 2, "max_workers": 8 } }

4. Check If DBFS or Storage Configuration Is Causing Issues

Symptoms:

  • Error: “Cluster creation failed: Cannot mount storage.”
  • Cluster fails due to incorrect DBFS mount or cloud storage access errors.

Fix:

Test DBFS access manually:

dbutils.fs.ls("dbfs:/mnt/")

Ensure correct storage credentials are used:

  • AWS S3 Mount Example: dbutils.fs.mount( source="s3a://my-bucket", mount_point="/mnt/s3", extra_configs={"fs.s3a.access.key": "MY_ACCESS_KEY", "fs.s3a.secret.key": "MY_SECRET_KEY"} )
  • Azure ADLS Mount Example: dbutils.fs.mount( source="wasbs://container@storageaccount.blob.core.windows.net", mount_point="/mnt/adls", extra_configs={"fs.azure.account.key.storageaccount.blob.core.windows.net": "STORAGE_KEY"} )
  • Google Cloud Storage Mount Example: dbutils.fs.mount( source="gs://my-bucket", mount_point="/mnt/gcs", extra_configs={"fs.gs.auth.service.account.json.keyfile": "gcs-key.json"} )

Try unmounting and remounting if the issue persists:

dbutils.fs.unmount("/mnt/s3")

5. Increase Cloud Resource Quotas if Needed

Symptoms:

  • Cluster creation fails repeatedly due to quota limits.
  • Error: “Quota exceeded: Cannot create additional instances.”

Fix:

Request a quota increase for compute resources:

  • AWS: aws service-quotas increase-service-quota --service-code ec2 --quota-code L-1216C47A --desired-value 20
  • Azure: az vm quota update --resource-group my-rg --sku Standard_DS3_v2 --limit 20
  • GCP: gcloud compute regions describe us-central1 gcloud compute project-info add-metadata --metadata compute-quotas=20

6. Restart the Cluster and Monitor Logs

Fix:

Check Cluster Event Logs:

  • Go to Databricks UI → Clusters → Event Log
  • Look for capacity errors, IAM issues, or runtime failures.

Restart the cluster with a lower node count and validate:

databricks clusters restart --cluster-id <cluster-id>

Step-by-Step Troubleshooting Guide

Step 1: Check If Cloud Resources Are Available

  • AWS: aws ec2 describe-instance-type-offerings --region us-east-1
  • Azure: az vm list-sizes --location eastus
  • GCP: gcloud compute regions describe us-central1

Step 2: Verify IAM Permissions

  • AWS: Ensure IAM role has EC2 and S3 access.
  • Azure: Ensure Managed Identity has Contributor role.
  • GCP: Ensure Service Account has Dataproc permissions.

Step 3: Ensure Databricks Runtime and Cluster Settings Are Valid

  • Use a stable Databricks runtime version.
  • Reduce min/max worker count to test.

Step 4: Check Cloud Storage Mounts and DBFS Configurations

dbutils.fs.ls("dbfs:/mnt/")

Step 5: Restart and Monitor Logs

databricks clusters restart --cluster-id <cluster-id>

Best Practices to Avoid Cluster Creation Failures

Use Available Instance Types

  • Always choose widely available instance types in your region.

Enable Auto-Scaling to Prevent Resource Shortages

{
  "autoscale": {
    "min_workers": 2,
    "max_workers": 10
  }
}

Check IAM and Cloud Quotas Regularly

  • Ensure IAM roles and service accounts have correct permissions.
  • Monitor and increase cloud quotas if needed.

Conclusion

If Databricks cluster creation fails (CLUSTER001 – Capacity or Config Issue):
Check cloud resource availability and switch to a different instance type.
Verify IAM roles and permissions for Databricks compute resources.
Ensure Databricks runtime and cluster configurations are correct.
Fix cloud storage mounting issues if relevant.
Restart the cluster and monitor event logs for errors.

By following these steps, you can successfully resolve cluster creation failures and launch your Databricks environment.

guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x