Here’s list of 20 common Databricks issues users face, with additional context and examples for clarity:
1. Cluster Launch Failures
- Causes: Cloud provider quotas (e.g., AWS/Azure instance limits), misconfigured network settings (VPC, subnets), or invalid Spark configurations.
- Fix: Check cloud provider quotas, validate network settings, and review cluster configurations.
2. Slow Query Performance
- Causes: Poorly optimized Spark SQL, lack of caching, or small file problems (e.g., too many tiny Parquet/Delta files).
- Example: A
SELECT *
query on a massive Delta table without predicate filters.
3. Data Skew in Spark Jobs
- Signs: A few tasks take significantly longer than others.
- Fix: Use salting, repartitioning, or adaptive query execution (AQE) in Spark 3.0+.
4. Out-of-Memory (OOM) Errors
- Common In: Executors or driver nodes due to insufficient
spark.executor.memory
orspark.driver.memory
settings. - Fix: Tune memory settings or increase instance sizes.
5. Cost Overruns
- Causes: Overprovisioned clusters (e.g., large instances for small workloads), idle clusters left running, or frequent autoscaling.
- Tip: Use job clusters instead of all-purpose clusters for scheduled workloads.
6. Delta Lake Merge/Update Failures
- Issues: Conflicts during
MERGE
operations, schema mismatches, or transactional log corruption. - Fix: Use optimistic concurrency control and validate schemas before merging.
7. Notebook Dependency Conflicts
- Example: Mixing incompatible Python libraries (e.g., Pandas vs. PySpark versions).
- Fix: Use cluster-scoped init scripts or MLflow for environment isolation.
8. Metastore Connectivity Issues
- Errors: Hive metastore (AWS Glue/Azure SQL) timeouts or permission issues.
- Root Cause: Network ACLs, security group rules, or IAM roles misconfigured.
9. Autoscaling Inefficiency
- Problem: Clusters not scaling up/down as expected due to uneven workloads or misconfigured scaling policies.
- Fix: Adjust
spark.databricks.autoscaling
settings or use instance pools.
10. Timeout Errors in Streaming Jobs
- Example: Structured Streaming jobs failing due to
TimeoutException
in Kafka or Event Hubs. - Fix: Increase timeout thresholds or optimize micro-batch processing.
11. Access Control Issues
- Scenarios: Users unable to access notebooks, data, or clusters due to misconfigured Databricks workspace permissions or Unity Catalog policies.
- Tip: Use granular access controls in Unity Catalog (for data governance).
12. DBFS Mount Failures
- Causes: Invalid credentials for cloud storage (S3, ADLS) or expired SAS tokens.
- Fix: Re-mount storage with updated credentials and validate IAM roles.
13. Job Failures with External Libraries
- Issue: Jobs failing due to missing JARs/Python wheels or version conflicts.
- Fix: Use cluster-level libraries or install dependencies via init scripts.
14. Schema Evolution Problems
- Example: Adding columns to Delta tables causing downstream job failures.
- Tip: Use
mergeSchema=true
or enable automatic schema evolution.
15. Driver Node Crashes
- Causes: Heavy operations (e.g.,
.collect()
) on the driver, exceeding driver memory. - Fix: Avoid collecting large datasets to the driver; offload to executors.
16. Cloud Storage Latency
- Problem: Slow reads/writes to S3, ADLS, or GCS due to eventual consistency or throttling.
- Fix: Use optimized connectors (e.g.,
s3a://
with AWS) or caching.
17. Inconsistent Time Zones
- Example: Timestamps in notebooks showing UTC vs. local time discrepancies.
- Fix: Set
spark.sql.session.timeZone
in cluster configurations.
18. Delta Table Vacuum Issues
- Error:
Cannot vacuum, files in use
due to retention period conflicts. - Solution: Adjust
delta.logRetentionDuration
anddelta.deletedFileRetentionDuration
.
19. API Rate Limiting
- Scenario: Automation scripts hitting Databricks REST API rate limits.
- Fix: Implement retry logic with exponential backoff.
20. Secret Management Problems
- Example: Hardcoded credentials in notebooks or failed secret scoping via Databricks Secrets API.
- Best Practice: Use Databricks Secrets with Azure Key Vault/AWS Secrets Manager integration.