Introduction
If your Databricks cluster fails to start with the error “CLUSTER001 – Cluster creation failed (capacity or config issue)”, it typically indicates insufficient resources, misconfigured settings, or cloud provider limitations.
🚨 Common causes of cluster creation failures:
- Insufficient cloud provider capacity (AWS, Azure, or GCP).
- Invalid instance type or region constraints.
- Misconfigured IAM roles or permissions.
- Incorrect Databricks runtime settings.
This guide covers troubleshooting steps and solutions to successfully start your cluster.
1. Check Cloud Provider Resource Availability
Symptoms:
- Error: “CLUSTER001: Instance type is unavailable in the selected region.”
- Cluster fails due to insufficient compute capacity in the cloud provider.
- AWS: EC2 instance limits exceeded.
- Azure: VM quota reached or SKU unavailable.
- GCP: Insufficient CPU/GPU quota.
Fix:
✅ Reduce cluster size or switch to an available instance type:
- AWS:
aws ec2 describe-instance-type-offerings --location-type region --region us-east-1
- Try using a different instance type (e.g., switch from
m5.xlarge
tom5.2xlarge
).
- Try using a different instance type (e.g., switch from
- Azure:
az vm list-sizes --location eastus
- Request a quota increase if VM quota is exceeded:
az vm quota update --resource-group my-rg --sku Standard_D4s_v3 --limit 10
- Request a quota increase if VM quota is exceeded:
- GCP:
gcloud compute regions describe us-central1
- Switch to another region or use a different machine type.
✅ Change cluster configuration in Databricks:
- Go to Databricks UI → Clusters → Edit Cluster.
- Change Instance Type to a smaller or more available machine.
- Select a different cloud region if capacity issues persist.
2. Verify IAM Roles and Permissions
Symptoms:
- Cluster fails with permission-related errors.
- Error: “Permission denied: Cannot create cluster.”
Fix:
✅ Ensure IAM roles have required permissions for Databricks cluster creation.
AWS IAM Policy:
{
"Effect": "Allow",
"Action": [
"ec2:RunInstances",
"ec2:DescribeInstances",
"ec2:CreateTags",
"s3:GetObject",
"s3:ListBucket"
],
"Resource": "*"
}
Azure Databricks Managed Identity:
az role assignment create --assignee <service-principal-id> --role "Contributor" --scope /subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Databricks/workspaces/<workspace-id>
GCP IAM Policy for Databricks Service Account:
gcloud projects add-iam-policy-binding <project-id> --member="serviceAccount:databricks-sa@<project-id>.iam.gserviceaccount.com" --role="roles/dataproc.worker"
3. Validate Databricks Runtime and Cluster Configurations
Symptoms:
- Error: “Invalid cluster configuration.”
- Cluster fails due to an unsupported Databricks runtime version.
Fix:
✅ Ensure the Databricks runtime is supported:
- Go to Databricks UI → Clusters → Edit Cluster.
- Select a stable Databricks runtime (avoid preview or unsupported versions).
- For GPU clusters, ensure correct CUDA drivers are installed.
✅ Check Auto-Scaling and Node Configurations:
- Reduce the minimum and maximum worker count to ensure the cluster starts.
- Example configuration:
{ "autoscale": { "min_workers": 2, "max_workers": 8 } }
4. Check If DBFS or Storage Configuration Is Causing Issues
Symptoms:
- Error: “Cluster creation failed: Cannot mount storage.”
- Cluster fails due to incorrect DBFS mount or cloud storage access errors.
Fix:
✅ Test DBFS access manually:
dbutils.fs.ls("dbfs:/mnt/")
✅ Ensure correct storage credentials are used:
- AWS S3 Mount Example:
dbutils.fs.mount( source="s3a://my-bucket", mount_point="/mnt/s3", extra_configs={"fs.s3a.access.key": "MY_ACCESS_KEY", "fs.s3a.secret.key": "MY_SECRET_KEY"} )
- Azure ADLS Mount Example:
dbutils.fs.mount( source="wasbs://container@storageaccount.blob.core.windows.net", mount_point="/mnt/adls", extra_configs={"fs.azure.account.key.storageaccount.blob.core.windows.net": "STORAGE_KEY"} )
- Google Cloud Storage Mount Example:
dbutils.fs.mount( source="gs://my-bucket", mount_point="/mnt/gcs", extra_configs={"fs.gs.auth.service.account.json.keyfile": "gcs-key.json"} )
✅ Try unmounting and remounting if the issue persists:
dbutils.fs.unmount("/mnt/s3")
5. Increase Cloud Resource Quotas if Needed
Symptoms:
- Cluster creation fails repeatedly due to quota limits.
- Error: “Quota exceeded: Cannot create additional instances.”
Fix:
✅ Request a quota increase for compute resources:
- AWS:
aws service-quotas increase-service-quota --service-code ec2 --quota-code L-1216C47A --desired-value 20
- Azure:
az vm quota update --resource-group my-rg --sku Standard_DS3_v2 --limit 20
- GCP:
gcloud compute regions describe us-central1 gcloud compute project-info add-metadata --metadata compute-quotas=20
6. Restart the Cluster and Monitor Logs
Fix:
✅ Check Cluster Event Logs:
- Go to Databricks UI → Clusters → Event Log
- Look for capacity errors, IAM issues, or runtime failures.
✅ Restart the cluster with a lower node count and validate:
databricks clusters restart --cluster-id <cluster-id>
Step-by-Step Troubleshooting Guide
Step 1: Check If Cloud Resources Are Available
- AWS:
aws ec2 describe-instance-type-offerings --region us-east-1
- Azure:
az vm list-sizes --location eastus
- GCP:
gcloud compute regions describe us-central1
Step 2: Verify IAM Permissions
- AWS: Ensure IAM role has EC2 and S3 access.
- Azure: Ensure Managed Identity has Contributor role.
- GCP: Ensure Service Account has Dataproc permissions.
Step 3: Ensure Databricks Runtime and Cluster Settings Are Valid
- Use a stable Databricks runtime version.
- Reduce min/max worker count to test.
Step 4: Check Cloud Storage Mounts and DBFS Configurations
dbutils.fs.ls("dbfs:/mnt/")
Step 5: Restart and Monitor Logs
databricks clusters restart --cluster-id <cluster-id>
Best Practices to Avoid Cluster Creation Failures
✅ Use Available Instance Types
- Always choose widely available instance types in your region.
✅ Enable Auto-Scaling to Prevent Resource Shortages
{
"autoscale": {
"min_workers": 2,
"max_workers": 10
}
}
✅ Check IAM and Cloud Quotas Regularly
- Ensure IAM roles and service accounts have correct permissions.
- Monitor and increase cloud quotas if needed.
Conclusion
If Databricks cluster creation fails (CLUSTER001 – Capacity or Config Issue):
✅ Check cloud resource availability and switch to a different instance type.
✅ Verify IAM roles and permissions for Databricks compute resources.
✅ Ensure Databricks runtime and cluster configurations are correct.
✅ Fix cloud storage mounting issues if relevant.
✅ Restart the cluster and monitor event logs for errors.
By following these steps, you can successfully resolve cluster creation failures and launch your Databricks environment.