Mohammad Gufran Jahangir October 1, 2025 0

Optimal file sizes, compaction strategies, and how to keep your Delta tables lightning-fast


Why Small Files Are a Big Problem

Delta Lake is powerful, but it inherits one common “data lake disease”: too many small files.

  • What are small files?
    Parquet files that are much smaller than the recommended size (typically < 128 MB).
  • Why do they happen?
    • Streaming jobs writing micro-batches every few seconds.
    • Multiple concurrent writers appending tiny chunks.
    • Over-partitioning tables (e.g., daily + region + customer).
  • Why are they bad?
    • Spark must open every file → High metadata overhead.
    • Scans become slow → Even queries on small data sets hit thousands of files.
    • Cluster costs skyrocket → More compute wasted on file management instead of actual processing.

Think of it like this: instead of carrying a few big boxes, Spark ends up juggling thousands of tiny envelopes.


The Target: Optimal File Size

Delta (and underlying Parquet) perform best when files are 128 MB – 1 GB each.

  • < 64 MB → Too many files, high overhead.
  • 1 GB → Risk of split/scan inefficiency, hard to parallelize.

👉 Sweet Spot: 256 MB – 512 MB for balanced throughput and flexibility.


How to Fix It: Compaction Strategies

1. Bin-Packing with OPTIMIZE

Databricks provides the OPTIMIZE command to compact small files into larger ones.

-- Compact files in a Delta table
OPTIMIZE sales_data
WHERE date >= current_date() - INTERVAL 30 DAYS;

✅ Use WHERE to limit compaction to recent/hot partitions (saves cost).
✅ Run as a scheduled job (nightly/weekly depending on data load).


2. Z-ORDER for Query Acceleration

Compaction alone solves small files, but Z-ORDER improves query performance by clustering data.

-- Compact AND cluster by frequently filtered columns
OPTIMIZE sales_data
ZORDER BY (customer_id, region);
  • When to use: Columns often used in WHERE filters or JOIN conditions.
  • When NOT to use: High-cardinality columns (e.g., transaction_id) — no real benefit.

👉 Example:

  • Queries filtering by region and customer_id will scan fewer files.
  • Dashboards get faster response times with fewer shuffle operations.

3. Streaming Job Tuning

If you’re writing streaming data:

  • Trigger once + batch size: Use trigger=processingTime with larger intervals.
  • Auto Loader with microBatch: Write bigger batches to reduce file counts.
  • File notification mode: Helps avoid frequent small writes.

Example in PySpark:

df.writeStream \
  .format("delta") \
  .option("checkpointLocation", "/chk/sales") \
  .trigger(processingTime="5 minutes") \
  .start("/delta/sales")

👉 Instead of writing every 5 seconds, write every 5 minutes = fewer, bigger files.


4. Partitioning Hygiene

Over-partitioning = tiny files per partition.

  • Avoid: PARTITION BY (year, month, day, region, country, city)
  • Prefer: PARTITION BY (year, month) or (region) depending on query patterns.

👉 Rule of thumb: Don’t go beyond 2–3 partition columns unless absolutely necessary.


Compaction Schedule: A Practical Playbook

Table TypeCompaction FrequencyOPTIMIZE StrategyNotes
Hot tables (daily dashboards, streaming inputs)DailyOPTIMIZE last 7–30 days, Z-ORDER by query columnsKeep dashboards fast
Warm tables (monthly reporting, moderate use)WeeklyOPTIMIZE last 90 daysBalance cost & performance
Cold tables (rarely queried, historical archive)Monthly or On-DemandOPTIMIZE full tableOnly when needed

Example: From Pain to Performance

Imagine you have a sales_data table:

  • Before: 1 TB of data → 1.2 million files (~1 MB each). Query time = 8 minutes.
  • After OPTIMIZE + Z-ORDER: 1 TB → 4,000 files (~256 MB each). Query time = 40 seconds.

That’s the power of compaction + clustering.


Best Practices for Admins

  • Monitor file size distribution: Use Databricks system tables (delta.files) to track.
  • Automate compaction: Schedule OPTIMIZE with ADF, Airflow, or Databricks Workflows.
  • Set guardrails: Use cluster policies to prevent users from creating tiny file dumps.
  • Balance cost vs. speed: Compaction is expensive — apply it where business value is highest.

Final Thoughts

Taming small files in Delta Lake isn’t optional — it’s critical for both performance and cost efficiency.

  • Compact regularly with OPTIMIZE.
  • Use Z-ORDER to align data with query patterns.
  • Avoid over-partitioning.
  • Automate compaction schedules based on table usage.

By applying these practices, you’ll turn your Delta Lake into a fast, lean, query-friendly data powerhouse.


✨ Next Steps for You:

  1. Audit one Delta table today for small files.
  2. Run OPTIMIZE + ZORDER on your most queried columns.
  3. Set up a weekly compaction job to keep things healthy.

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments