Optimal file sizes, compaction strategies, and how to keep your Delta tables lightning-fast

Why Small Files Are a Big Problem
Delta Lake is powerful, but it inherits one common “data lake disease”: too many small files.
- What are small files?
Parquet files that are much smaller than the recommended size (typically < 128 MB). - Why do they happen?
- Streaming jobs writing micro-batches every few seconds.
- Multiple concurrent writers appending tiny chunks.
- Over-partitioning tables (e.g., daily + region + customer).
- Why are they bad?
- Spark must open every file → High metadata overhead.
- Scans become slow → Even queries on small data sets hit thousands of files.
- Cluster costs skyrocket → More compute wasted on file management instead of actual processing.
Think of it like this: instead of carrying a few big boxes, Spark ends up juggling thousands of tiny envelopes.
The Target: Optimal File Size
Delta (and underlying Parquet) perform best when files are 128 MB – 1 GB each.
- < 64 MB → Too many files, high overhead.
- 1 GB → Risk of split/scan inefficiency, hard to parallelize.
👉 Sweet Spot: 256 MB – 512 MB for balanced throughput and flexibility.
How to Fix It: Compaction Strategies
1. Bin-Packing with OPTIMIZE
Databricks provides the OPTIMIZE command to compact small files into larger ones.
-- Compact files in a Delta table
OPTIMIZE sales_data
WHERE date >= current_date() - INTERVAL 30 DAYS;
✅ Use WHERE to limit compaction to recent/hot partitions (saves cost).
✅ Run as a scheduled job (nightly/weekly depending on data load).
2. Z-ORDER for Query Acceleration
Compaction alone solves small files, but Z-ORDER improves query performance by clustering data.
-- Compact AND cluster by frequently filtered columns
OPTIMIZE sales_data
ZORDER BY (customer_id, region);
- When to use: Columns often used in
WHEREfilters orJOINconditions. - When NOT to use: High-cardinality columns (e.g., transaction_id) — no real benefit.
👉 Example:
- Queries filtering by
regionandcustomer_idwill scan fewer files. - Dashboards get faster response times with fewer shuffle operations.
3. Streaming Job Tuning
If you’re writing streaming data:
- Trigger once + batch size: Use
trigger=processingTimewith larger intervals. - Auto Loader with microBatch: Write bigger batches to reduce file counts.
- File notification mode: Helps avoid frequent small writes.
Example in PySpark:
df.writeStream \
.format("delta") \
.option("checkpointLocation", "/chk/sales") \
.trigger(processingTime="5 minutes") \
.start("/delta/sales")
👉 Instead of writing every 5 seconds, write every 5 minutes = fewer, bigger files.
4. Partitioning Hygiene
Over-partitioning = tiny files per partition.
- Avoid:
PARTITION BY (year, month, day, region, country, city) - Prefer:
PARTITION BY (year, month)or(region)depending on query patterns.
👉 Rule of thumb: Don’t go beyond 2–3 partition columns unless absolutely necessary.
Compaction Schedule: A Practical Playbook
| Table Type | Compaction Frequency | OPTIMIZE Strategy | Notes |
|---|---|---|---|
| Hot tables (daily dashboards, streaming inputs) | Daily | OPTIMIZE last 7–30 days, Z-ORDER by query columns | Keep dashboards fast |
| Warm tables (monthly reporting, moderate use) | Weekly | OPTIMIZE last 90 days | Balance cost & performance |
| Cold tables (rarely queried, historical archive) | Monthly or On-Demand | OPTIMIZE full table | Only when needed |
Example: From Pain to Performance
Imagine you have a sales_data table:
- Before: 1 TB of data → 1.2 million files (~1 MB each). Query time = 8 minutes.
- After OPTIMIZE + Z-ORDER: 1 TB → 4,000 files (~256 MB each). Query time = 40 seconds.
That’s the power of compaction + clustering.
Best Practices for Admins
- Monitor file size distribution: Use Databricks system tables (
delta.files) to track. - Automate compaction: Schedule
OPTIMIZEwith ADF, Airflow, or Databricks Workflows. - Set guardrails: Use cluster policies to prevent users from creating tiny file dumps.
- Balance cost vs. speed: Compaction is expensive — apply it where business value is highest.
Final Thoughts
Taming small files in Delta Lake isn’t optional — it’s critical for both performance and cost efficiency.
- Compact regularly with OPTIMIZE.
- Use Z-ORDER to align data with query patterns.
- Avoid over-partitioning.
- Automate compaction schedules based on table usage.
By applying these practices, you’ll turn your Delta Lake into a fast, lean, query-friendly data powerhouse.
✨ Next Steps for You:
- Audit one Delta table today for small files.
- Run
OPTIMIZE+ZORDERon your most queried columns. - Set up a weekly compaction job to keep things healthy.