Mohammad Gufran Jahangir August 13, 2025 0

This is the short, practical version of the big guide. Three tools you’ll use again and again:

  • Liquid Clustering → keep related rows stored near each other (faster filters).
  • Deletion Vectors (DVs) → mark rows as deleted/updated without rewriting big files.
  • OPTIMIZE → clean up small files and keep layout healthy.

We’ll use one tiny table and show copy‑paste SQL for each feature.


Sample table (use this for all examples)

CREATE TABLE IF NOT EXISTS sales.orders_simple (
  order_id     BIGINT,
  order_ts     TIMESTAMP,
  customer_id  STRING,
  country      STRING,
  amount       DECIMAL(18,2)
) USING DELTA;

1) Liquid Clustering — think “group similar rows together”

Why: Queries like WHERE customer_id = 'C123' AND DATE(order_ts) = '2025-08-13' only scan a few files instead of the whole table.

How to turn it on (you can add later too):

ALTER TABLE sales.orders_simple
CLUSTER BY (customer_id, date_trunc('day', order_ts));

Keep it tidy: run OPTIMIZE on the recent data (cheap and effective):

OPTIMIZE sales.orders_simple
WHERE order_ts >= date_sub(current_timestamp(), 7);   -- last 7 days

When to use: You frequently filter by customer + day, tenant + date, device + date, etc. Pick 1–2 columns you filter on most.


2) Deletion Vectors — think “post‑it notes on rows you want to hide”

Why: DELETE/MERGE becomes much faster; Databricks marks rows instead of rewriting whole Parquet files.

Enable once per table:

ALTER TABLE sales.orders_simple
SET TBLPROPERTIES (delta.enableDeletionVectors = true);

Do a delete (fast):

DELETE FROM sales.orders_simple WHERE order_id = 42;

Rows are hidden immediately. Later, you can physically remove them during maintenance.

Purge the marks later (optional cleanup):

-- Either compact recent data (rewrites files and drops many DV marks)
OPTIMIZE sales.orders_simple
WHERE order_ts >= date_sub(current_timestamp(), 30);

-- Or, if available in your runtime, a direct purge command
-- REORG TABLE sales.orders_simple APPLY (PURGE DELETION VECTORS);

Good to know: Time Travel and Change Data Feed still work with DVs.


3) OPTIMIZE — think “vacuum seal lots of crumbs into a few big bags”

Why: Many tiny files make reads slow. OPTIMIZE compacts them (target ~1GB each) and, if you set CLUSTER BY, it also re‑clusters data.

Basic:

OPTIMIZE sales.orders_simple;

Cheaper, smarter (only hot data):

OPTIMIZE sales.orders_simple
WHERE order_ts >= current_date() - INTERVAL 14 DAYS;

Legacy (non‑Liquid tables only):

-- If you didn’t enable Liquid Clustering and still want multi‑col pruning
OPTIMIZE sales.orders_simple ZORDER BY (customer_id, order_ts);

4) Put it all together (2‑minute routine)

  1. Declare clustering on the columns you filter by most:
ALTER TABLE sales.orders_simple CLUSTER BY (customer_id, date_trunc('day', order_ts));
  1. Enable deletion vectors so upserts/Deletes are fast:
ALTER TABLE sales.orders_simple SET TBLPROPERTIES (delta.enableDeletionVectors = true);
  1. Nightly/weekly maintenance on the recent window:
OPTIMIZE sales.orders_simple WHERE order_ts >= date_sub(current_timestamp(), 7);

Quick Q&A

Q: How do I pick cluster columns?
A: Choose 1–2 columns that appear in your WHERE most. Avoid booleans or low‑cardinality flags.

Q: Will clustering or DVs break sharing?
A: Some external readers may not support DVs. If sharing outside Databricks, run an OPTIMIZE or DV purge first.

Q: How do I know it worked?
A: DESCRIBE HISTORY sales.orders_simple; → check operation = OPTIMIZE and metrics. Also, query profiles will show files pruned increasing.


Tiny playground (copy–paste order)

-- 1) Create table
CREATE TABLE IF NOT EXISTS sales.orders_simple (
  order_id BIGINT, order_ts TIMESTAMP, customer_id STRING, country STRING, amount DECIMAL(18,2)
) USING DELTA;

-- 2) Add clustering and DVs
ALTER TABLE sales.orders_simple CLUSTER BY (customer_id, date_trunc('day', order_ts));
ALTER TABLE sales.orders_simple SET TBLPROPERTIES (delta.enableDeletionVectors = true);

-- 3) Ingest some sample data (toy example)
INSERT INTO sales.orders_simple VALUES
  (1, current_timestamp(), 'C1', 'US', 10.00),
  (2, current_timestamp(), 'C1', 'US', 12.50),
  (3, current_timestamp(), 'C2', 'CA', 9.99);

-- 4) Delete one row (fast via DV)
DELETE FROM sales.orders_simple WHERE order_id = 2;

-- 5) Compact & cluster recent data
OPTIMIZE sales.orders_simple WHERE order_ts >= date_sub(current_timestamp(), 7);

One‑page cheat‑sheet

  • CLUSTER BY groups related rows → faster filters.
  • DVs on for fast DELETE/MERGE → purge later with OPTIMIZE/REORG.
  • OPTIMIZE regularly on recent data → fewer small files + keeps clustering fresh.
  • Check history and query profile to confirm improvements.

That’s it. Three switches, a small nightly OPTIMIZE, and your Delta tables stay fast and tidy.

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments