Introduction
The SPARK005 – Too Many Open Files error in Databricks indicates that the number of open file descriptors has exceeded the system’s limit. This issue can slow down workloads, cause job failures, and lead to out-of-memory (OOM) errors.
🚨 Common symptoms of SPARK005 (Too Many Open Files):
- Job failures due to too many open files.
- Slow or stalled Spark queries.
- High file descriptor usage in logs (
Too many open files
). - File system errors when reading or writing data.
This guide explains the root causes of SPARK005, how to troubleshoot it, and best practices to prevent it in Databricks.
Common Causes of SPARK005 in Databricks
Cause | Impact |
---|---|
Too many small files in cloud storage | Exceeding file descriptor limits |
High shuffle operations with many partitions | Too many open file handles |
Inefficient data format (e.g., CSV instead of Parquet/Delta) | Increased file system load |
Overloaded driver node | Driver crashes due to open file limit |
Databricks cluster file limit reached | Cluster becomes unresponsive |
How to Fix SPARK005 – Too Many Open Files in Databricks
1. Identify the Number of Open Files in the Cluster
Run the following command in a Databricks notebook to check file descriptor usage:
ls -l /proc/$(cat /var/run/spark/spark-*.pid)/fd | wc -l
This will show how many file descriptors Spark is using.
To check the system-wide file limit, run:
ulimit -n
Typical values:
- Soft limit:
1024
- Hard limit:
65535
(recommended for Databricks workloads)
2. Increase the File Descriptor Limit for Databricks Clusters
✅ Increase the ulimit
settings for file descriptors by modifying the cluster startup script:
- Go to Databricks UI → Compute → Edit Cluster
- Add the following script under Init Scripts:
#!/bin/bash
ulimit -n 65535
- Restart the cluster to apply changes.
✅ Alternatively, update the file limit using Spark configurations:
{
"spark.driver.maxResultSize": "4g",
"spark.worker.cleanup.enabled": "true",
"spark.executor.extraJavaOptions": "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp"
}
3. Optimize Data Formats and Reduce the Number of Open Files
🚀 Switch from CSV to Parquet or Delta to reduce file count and improve performance.
❌ Bad (Too many small files in CSV format):
df.write.format("csv").save("s3://mybucket/data/")
✅ Good (Optimized for fewer open files):
df.write.format("parquet").save("s3://mybucket/data/")
✅ Best (Use Delta for better metadata handling):
df.write.format("delta").save("s3://mybucket/delta_table/")
4. Reduce the Number of Partitions in Spark Jobs
🔧 Too many partitions = too many open files.
✅ Reduce the number of output files using coalesce()
or repartition()
.
df = df.repartition(10) # Reduces the number of open file handles
✅ Reduce shuffle partitions for large jobs
{
"spark.sql.shuffle.partitions": "200"
}
5. Enable Auto-Compaction for Delta Tables
Delta tables can accumulate too many small files, causing excessive open files.
✅ Enable Auto-Optimize for Delta tables:
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoOptimize.enabled", "true")
✅ Manually optimize small files in Delta tables:
OPTIMIZE delta.`s3://mybucket/delta_table/` ZORDER BY (column_name);
6. Use Broadcast Joins to Reduce Shuffle File Overhead
🚀 If joining large tables, enable broadcast joins to prevent excessive shuffle file creation.
✅ Enable Auto Broadcast Joins in Spark:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10MB")
✅ Manually broadcast smaller DataFrames:
from pyspark.sql.functions import broadcast
df_small = broadcast(df_small)
df_large.join(df_small, "id")
7. Monitor Open File Handles and Cluster Logs
✅ Check active Spark jobs for excessive open files:
- Go to Spark UI → Executors → Storage Tab
- Look for high file descriptor counts
✅ Enable logging for file descriptor usage:
{
"spark.executor.extraJavaOptions": "-Dsun.nio.fs.fileAttributesCacheMaxSize=100000"
}
✅ Check logs for file descriptor warnings:
cat /var/log/spark/spark-*.log | grep "Too many open files"
Best Practices to Prevent SPARK005 (Too Many Open Files)
✅ Optimize File Formats
- Use Parquet or Delta Lake instead of CSV.
- Enable Auto-Compaction for Delta tables.
✅ Manage File Descriptor Limits
- Increase
ulimit -n
to 65535. - Ensure Spark cluster has enough resources.
✅ Reduce Partitions and Open Files
- Use
repartition(10)
instead of excessive small files. - Reduce shuffle partitions (
spark.sql.shuffle.partitions=200
).
✅ Monitor and Debug File Usage
- Use Spark UI → Executors → Storage Tab.
- Log file descriptor counts periodically.
Conclusion
SPARK005 – Too Many Open Files occurs when Databricks workloads exceed file descriptor limits, causing job failures and slowdowns.
By optimizing file formats, reducing small files, tuning partitions, and increasing ulimit
settings, you can prevent file descriptor exhaustion and improve Databricks job performance.