,

AUTOSCALE001 – Autoscaling Failure (Not Enough Capacity) in Databricks

Posted by

Introduction

The AUTOSCALE001 error in Databricks indicates that autoscaling failed due to insufficient cloud capacity. This happens when Databricks tries to add worker nodes to an autoscaling cluster, but AWS, Azure, or GCP cannot provision additional instances due to:

  • Cloud provider capacity limits in the selected region.
  • Quota restrictions on the Databricks account.
  • Instance type availability issues (e.g., spot instances unavailable).
  • Network or permissions misconfigurations preventing scale-up.

🚨 Common symptoms of AUTOSCALE001:

  • Cluster fails to scale up beyond a certain number of workers.
  • Job execution slows down or gets stuck due to limited compute.
  • Intermittent failures in autoscaling, working sometimes but not always.

This guide explores troubleshooting steps and best practices to resolve AUTOSCALE001 errors in Databricks.


1. Check Cloud Provider Capacity and Quotas

Symptoms:

  • Autoscaling fails to add workers, even though they are needed.
  • Error: “Cloud provider does not have enough capacity for the requested instance type.”
  • AWS EC2, Azure VM, or GCP Compute instances cannot be allocated.

Causes:

  • Cloud provider region has reached capacity limits for the requested instance type.
  • Databricks account has hit quota limits for VMs or CPUs.
  • Certain instance types (e.g., spot instances) are unavailable.

Fix:

Check available cloud capacity:

  • AWS:
aws ec2 describe-instance-type-offerings --region us-east-1 --location-type availability-zone
  • Azure:
az vm list-usage --location eastus
  • GCP:
gcloud compute regions describe us-central1

Switch to a different instance type (if available):

  • Edit cluster configuration → Worker node type
  • Choose a different VM family (e.g., Standard_D3_v2 instead of Standard_D2_v2).

Increase cloud quota for your account:

  • AWS: Request a quota increase in the AWS Service Quotas console.
  • Azure: Request quota increase in Azure Subscription Limits.
  • GCP: Increase CPU limits in Google Cloud Console → Quotas.

If using spot instances, allow fallback to on-demand:

  • Modify cluster worker configuration to allow on-demand instances.

2. Verify Databricks Cluster Autoscaling Limits

Symptoms:

  • Autoscaling stops before reaching expected max workers.
  • Job execution is slow despite autoscaling being enabled.
  • Cluster does not scale beyond a fixed number of nodes.

Causes:

  • Max worker nodes in the cluster settings are too low.
  • Cluster policies restrict autoscaling.
  • Databricks workspace-wide compute limits are reached.

Fix:

Check the max worker nodes limit:

  • Go to Databricks UI → Clusters → Edit Cluster → Autoscaling Settings.
  • Increase “Maximum Workers” to a higher number.

Check Databricks workspace-wide compute limits:

  • Go to Admin Console → Compute Settings and ensure the workspace has enough compute resources.

Ensure cluster policies allow autoscaling:

databricks cluster-policies list
  • If a policy restricts autoscaling, modify or remove it.

Monitor cluster scaling logs:

  • Go to Clusters → Metrics → Autoscaling Logs and check if autoscaling is blocked.

3. Troubleshoot Instance Type Availability (AWS, Azure, GCP)

Symptoms:

  • Certain instance types (e.g., r5.xlarge) cannot be provisioned.
  • AWS Spot instances frequently fail to start.
  • Azure VM scaling fails for specific SKU types.

Causes:

  • Cloud provider does not have enough available instances in the requested region.
  • Spot instance pool is exhausted, preventing scale-up.
  • Instance type is restricted in the cloud provider’s availability zone.

Fix:

Switch to a different instance type:

  • Try changing from r5.xlargec5.4xlarge or similar.

For AWS, enable multiple availability zones for autoscaling:

aws ec2 modify-instance-placement --instance-id i-1234567890abcdef0 --tenancy default

For Azure, choose a different VM SKU:

az vm list-sizes --location eastus

For GCP, check and switch to an available machine type:

gcloud compute machine-types list --filter="zone:us-central1-a"

If using spot instances, allow fallback to on-demand instances.


4. Check Network and Permissions Configuration

Symptoms:

  • Cluster fails to scale beyond a certain point, despite capacity availability.
  • Scaling requests are rejected due to networking errors.
  • Error: “Failed to provision instances due to VPC restrictions.”

Causes:

  • VPC or subnet IP exhaustion prevents new instances from being allocated.
  • Security group restrictions block communication between worker nodes.
  • IAM permissions prevent instance creation.

Fix:

Check available IPs in the subnet:

aws ec2 describe-subnets --query "Subnets[*].AvailableIpAddressCount"

Expand the subnet or use a different VPC if necessary.

Verify security group rules allow worker node communication:

aws ec2 describe-security-groups --group-ids sg-12345678

Ensure IAM permissions allow instance scaling:

{
  "Effect": "Allow",
  "Action": ["ec2:RunInstances", "ec2:CreateTags"],
  "Resource": "*"
}

5. Review Databricks Autoscaling Logs and Metrics

Symptoms:

  • Cluster autoscaling stops unexpectedly.
  • No clear reason for failure in UI.

Fix:

Enable autoscaling logs:

  • Go to Clusters → Metrics → Autoscaling Logs and check for errors.

Run Databricks diagnostic commands:

databricks clusters get --cluster-id <cluster-id>

Step-by-Step Troubleshooting Guide

1. Check Cloud Provider Capacity and Quotas

aws ec2 describe-instance-type-offerings --region us-east-1
az vm list-usage --location eastus
gcloud compute regions describe us-central1
  • If capacity is exhausted, switch to a different instance type.

2. Verify Autoscaling Limits in Cluster Settings

  • Increase Maximum Workers in Databricks cluster settings.

3. Check Network and IAM Permissions

aws ec2 describe-subnets --query "Subnets[*].AvailableIpAddressCount"
  • Ensure the VPC has enough IPs available for scaling.

4. Enable Logs and Monitoring for Further Debugging

databricks clusters get --cluster-id <cluster-id>

Best Practices to Avoid AUTOSCALE001 Errors

Use Multiple Instance Types for Autoscaling

  • Add multiple instance families to prevent scaling failures.

Use Fallback to On-Demand Instances for Spot Clusters

  • Enable on-demand fallback to prevent spot interruptions.

Monitor and Increase Cloud Quotas Proactively

  • Regularly check quota limits to avoid unexpected failures.

Optimize Cluster Autoscaling Settings

  • Set realistic min/max workers based on job workload.

Ensure VPC/Subnet Has Enough IP Addresses for Scaling

  • Avoid IP exhaustion blocking new worker nodes.

Conclusion

If AUTOSCALE001 – Autoscaling failure (not enough capacity) occurs in Databricks:
Check cloud provider capacity and quota limits.
Ensure correct autoscaling settings in cluster configurations.
Switch to a different instance type if necessary.
Verify IAM, VPC, and networking configurations.
Use logs and metrics to diagnose and resolve failures.

By following these steps, you can resolve autoscaling failures and ensure smooth scaling of Databricks clusters.

guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x