SPARK005 – Too Many Open Files (Exceeding File Limit) in Databricks

Mohammad Gufran Jahangir February 4, 2025 0

Table of Contents

Introduction

The SPARK005 – Too Many Open Files error in Databricks indicates that the number of open file descriptors has exceeded the system’s limit. This issue can slow down workloads, cause job failures, and lead to out-of-memory (OOM) errors.

🚨 Common symptoms of SPARK005 (Too Many Open Files):

Job failures due to too many open files.
Slow or stalled Spark queries.
High file descriptor usage in logs (Too many open files).
File system errors when reading or writing data.

This guide explains the root causes of SPARK005, how to troubleshoot it, and best practices to prevent it in Databricks.

Common Causes of SPARK005 in Databricks

Cause	Impact
Too many small files in cloud storage	Exceeding file descriptor limits
High shuffle operations with many partitions	Too many open file handles
Inefficient data format (e.g., CSV instead of Parquet/Delta)	Increased file system load
Overloaded driver node	Driver crashes due to open file limit
Databricks cluster file limit reached	Cluster becomes unresponsive

How to Fix SPARK005 – Too Many Open Files in Databricks

1. Identify the Number of Open Files in the Cluster

Run the following command in a Databricks notebook to check file descriptor usage:

ls -l /proc/$(cat /var/run/spark/spark-*.pid)/fd | wc -l

This will show how many file descriptors Spark is using.

To check the system-wide file limit, run:

ulimit -n

Typical values:

Soft limit: 1024
Hard limit: 65535 (recommended for Databricks workloads)

2. Increase the File Descriptor Limit for Databricks Clusters

✅ Increase the ulimit settings for file descriptors by modifying the cluster startup script:

Go to Databricks UI → Compute → Edit Cluster
Add the following script under Init Scripts:

#!/bin/bash
ulimit -n 65535

Restart the cluster to apply changes.

✅ Alternatively, update the file limit using Spark configurations:

{
  "spark.driver.maxResultSize": "4g",
  "spark.worker.cleanup.enabled": "true",
  "spark.executor.extraJavaOptions": "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp"
}

3. Optimize Data Formats and Reduce the Number of Open Files

🚀 Switch from CSV to Parquet or Delta to reduce file count and improve performance.

❌ Bad (Too many small files in CSV format):

df.write.format("csv").save("s3://mybucket/data/")

✅ Good (Optimized for fewer open files):

df.write.format("parquet").save("s3://mybucket/data/")

✅ Best (Use Delta for better metadata handling):

df.write.format("delta").save("s3://mybucket/delta_table/")

4. Reduce the Number of Partitions in Spark Jobs

🔧 Too many partitions = too many open files.

✅ Reduce the number of output files using coalesce() or repartition().

df = df.repartition(10)  # Reduces the number of open file handles

✅ Reduce shuffle partitions for large jobs

{
  "spark.sql.shuffle.partitions": "200"
}

5. Enable Auto-Compaction for Delta Tables

Delta tables can accumulate too many small files, causing excessive open files.

✅ Enable Auto-Optimize for Delta tables:

spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoOptimize.enabled", "true")

✅ Manually optimize small files in Delta tables:

OPTIMIZE delta.`s3://mybucket/delta_table/` ZORDER BY (column_name);

6. Use Broadcast Joins to Reduce Shuffle File Overhead

🚀 If joining large tables, enable broadcast joins to prevent excessive shuffle file creation.

✅ Enable Auto Broadcast Joins in Spark:

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10MB")

✅ Manually broadcast smaller DataFrames:

from pyspark.sql.functions import broadcast

df_small = broadcast(df_small)
df_large.join(df_small, "id")

7. Monitor Open File Handles and Cluster Logs

✅ Check active Spark jobs for excessive open files:

Go to Spark UI → Executors → Storage Tab
Look for high file descriptor counts

✅ Enable logging for file descriptor usage:

{
  "spark.executor.extraJavaOptions": "-Dsun.nio.fs.fileAttributesCacheMaxSize=100000"
}

✅ Check logs for file descriptor warnings:

cat /var/log/spark/spark-*.log | grep "Too many open files"

Best Practices to Prevent SPARK005 (Too Many Open Files)

✅ Optimize File Formats

Use Parquet or Delta Lake instead of CSV.
Enable Auto-Compaction for Delta tables.

✅ Manage File Descriptor Limits

Increase ulimit -n to 65535.
Ensure Spark cluster has enough resources.

✅ Reduce Partitions and Open Files

Use repartition(10) instead of excessive small files.
Reduce shuffle partitions (spark.sql.shuffle.partitions=200).

✅ Monitor and Debug File Usage

Use Spark UI → Executors → Storage Tab.
Log file descriptor counts periodically.

Conclusion

SPARK005 – Too Many Open Files occurs when Databricks workloads exceed file descriptor limits, causing job failures and slowdowns.
By optimizing file formats, reducing small files, tuning partitions, and increasing ulimit settings, you can prevent file descriptor exhaustion and improve Databricks job performance.

Mohammad Gufran Jahangir

Tags: Databricks

Category: