Data Warehousing on Databricks with Databricks SQL & SQL Warehouses

Mohammad Gufran Jahangir August 30, 2025 0

Table of Contents

What is Databricks SQL (DBSQL)?

Databricks SQL is the lakehouse analytics experience for running SQL, building dashboards, alerts, and jobs on Delta tables. You connect to a SQL warehouse—a managed compute endpoint for SQL workloads. (Databricks Documentation)

SQL Warehouses, in one minute

A SQL warehouse is the compute you attach to in DBSQL clients; it auto-scales and caches for BI/SQL.
Choose Serverless type when available—lower ops and fast warm starts. (Databricks Documentation)

Create your first (Serverless) SQL Warehouse

Admin enables/uses Serverless SQL warehouses in workspace settings.
SQL → Warehouses → Create → Type: Serverless → choose size, min/max clusters, auto-stop → Grant “Can use” to groups.
Test:

SELECT current_user(), current_catalog(), current_schema();

(“What is a SQL warehouse?” and enablement details live in the docs.) (Databricks Documentation, Microsoft Learn)

Model the warehouse on the lake (Delta)

A typical layout is catalog.schema.table, with star schemas in curated zones:

-- Catalog/schema for the warehouse
CREATE CATALOG IF NOT EXISTS prod;
CREATE SCHEMA  IF NOT EXISTS prod.dw COMMENT 'Curated (gold) star schema';

-- Dimensions
CREATE TABLE IF NOT EXISTS prod.dw.dim_date   (date_key DATE PRIMARY KEY, ...);
CREATE TABLE IF NOT EXISTS prod.dw.dim_product(product_id BIGINT PRIMARY KEY, ...);

-- Fact
CREATE TABLE IF NOT EXISTS prod.dw.fact_sales(
  date_key DATE, product_id BIGINT, qty INT, amount DECIMAL(18,2), ...
);

Use Unity Catalog for permissions; keep Bronze/Silver/Gold separation but expose Gold to analysts.

Performance cheats that matter

1) Layout optimization (files)

Use OPTIMIZE to compact small files and (optionally) co-locate by columns. (If you’ve enabled liquid clustering on a table, use that; Z-order isn’t compatible with liquid clustering.) (Databricks Documentation)

-- Bin-pack files (default)
OPTIMIZE prod.dw.fact_sales;

-- Co-locate by a filter column (if not using liquid clustering)
OPTIMIZE prod.dw.fact_sales ZORDER BY (date_key);

2) Materialized Views for hot aggregates

Materialized views precompute results on a schedule (or incrementally) to lower latency and cost for dashboards. (Databricks Documentation)

CREATE MATERIALIZED VIEW IF NOT EXISTS prod.dw.mv_daily_sales
SCHEDULE EVERY 1 HOUR
AS SELECT date_key, SUM(amount) AS revenue
   FROM prod.dw.fact_sales
   GROUP BY date_key;

3) Caching & engine

Serverless SQL warehouses include the performance features of DBSQL (Photon, result cache, etc.) and are a great default for BI. (Databricks Documentation)

End-to-end loading pattern (incremental)

You can land raw files with Auto Loader (or CDC), curate them in Silver with DLT/Jobs, then surface Gold to DBSQL:

-- Gold build job (SQL task) that runs after Silver refresh
INSERT INTO prod.dw.fact_sales
SELECT d.date_key, p.product_id, f.qty, f.amount
FROM prod.silver.sales f
JOIN prod.dw.dim_date d     ON f.sale_date = d.date_key
JOIN prod.dw.dim_product p  ON f.product_code = p.product_code;

Schedule this SQL in a Job so your SQL warehouse always sees fresh Gold.

Governance & security (Unity Catalog)

Keep it simple and least-privilege:

-- Let analysts browse and read Gold
GRANT USE CATALOG ON CATALOG prod TO `data_analysts`;
GRANT USE SCHEMA  ON SCHEMA  prod.dw TO `data_analysts`;
GRANT SELECT ON ALL TABLES IN SCHEMA prod.dw TO `data_analysts`;
GRANT SELECT ON FUTURE TABLES IN SCHEMA prod.dw TO `data_analysts`;

Add row filters and column masks for PII when needed; manage storage via UC external locations/volumes.

Query-tuning checklist (for analysts)

Filter early (WHERE with selective predicates).
Prefer equality/IN against dimension keys (helps data skipping).
Avoid SELECT *; project needed columns only.
Use materialized views for heavy joins/aggregates; schedule refresh. (Databricks Documentation)
Keep tables compact with OPTIMIZE; consider Z-order (or liquid clustering if you’ve adopted it). (Databricks Documentation)

Operating your warehouse

Right-size: start smaller; enable autoscale; set an aggressive auto-stop for ad-hoc teams.
Workloads: give Analytics a dedicated warehouse; heavy ELT SQL tasks can run on a separate “job” warehouse.
Observability: use DBSQL query history, statement profile, and docs best practices for admin settings. (Databricks Documentation)

Copy-paste starter: end-user schema & MV

USE CATALOG prod;
USE SCHEMA dw;

-- Example query
SELECT p.category, SUM(f.amount) AS revenue
FROM fact_sales f
JOIN dim_product p ON f.product_id = p.product_id
WHERE f.date_key >= date('2025-01-01')
GROUP BY p.category
ORDER BY revenue DESC;

-- Materialized view for a dashboard
CREATE MATERIALIZED VIEW IF NOT EXISTS mv_revenue_by_category
SCHEDULE EVERY 15 MINUTES
AS SELECT p.category, SUM(f.amount) AS revenue
   FROM fact_sales f
   JOIN dim_product p ON f.product_id = p.product_id
   GROUP BY p.category;

Wrap-up

A SQL warehouse is your endpoint for BI/SQL—pick Serverless when available. (Databricks Documentation)
Model star schemas on Delta and expose Gold to analysts.
Use OPTIMIZE (and liquid clustering or Z-order where appropriate) and materialized views to hit sub-second dashboards. (Databricks Documentation)

Mohammad Gufran Jahangir

Tags: Databricks

Category:

Databricks