Mohammad Gufran Jahangir August 30, 2025 0

Lakehouse Federation lets you query external databases directly from Databricks—without copying the data into your lake. You point Unity Catalog at a source (PostgreSQL, SQL Server, Redshift, Snowflake, BigQuery, another Databricks workspace, and more), and then read those tables with regular SQL, governed by UC. (Databricks Documentation, Databricks)


The two flavors: Query vs. Catalog federation

  • Query federation: set up a connection to an external database; UC creates a foreign catalog that mirrors it. You run read-only queries; Databricks pushes work down over JDBC where possible. (Microsoft Learn, Databricks Documentation)
  • Catalog federation: connect to an external catalog (for example, an external Hive Metastore/Glue/Snowflake Horizon); UC crawls & mirrors it as a foreign catalog so you can search, secure, and query those tables in one place. (Azure Documentation)

In Databricks UI you’ll see these mirrors as Foreign catalogs under Unity Catalog. They behave like normal catalogs for browsing and governance, but are read-only. (Databricks Documentation, Microsoft Learn)


When to use Federation (and when not)

Great for:

  • Ad-hoc or live reporting on operational DBs you don’t want to ingest.
  • Cross-source joins (careful with size) to prototype analytics without pipelines.
  • Centralized governance & lineage (Unity Catalog) across lake + external DBs. (Databricks Documentation)

Consider alternatives when:

  • You need high-volume, low-latency analytics or heavy transforms—use ingestion (DLT/Auto Loader/Lakeflow Connect) into Delta instead. Databricks explicitly recommends the ingestion route when performance at scale matters. (Databricks Documentation)

Architecture (1-minute mental model)

  1. You create a Connection (credentials + endpoint).
  2. You create a Foreign catalog that points at a database/catalog in the source.
  3. Users query with three-part names like ext.sales.customers.
  4. Databricks pushes predicates/projections down to the source when it can; results stream back to your SQL warehouse/cluster. (Databricks Documentation, Microsoft Learn)

Setup: the minimal path (SQL)

You need: UC admin perms to create the connection & foreign catalog; a Pro/Serverless SQL warehouse or UC-enabled cluster.

1) Create a connection

-- Example: PostgreSQL (similar flow for SQL Server, Redshift, Snowflake, BigQuery…)
CREATE CONNECTION pg_conn
  TYPE POSTGRESQL
  OPTIONS (
    host 'pg.company.internal',
    port '5432',
    database 'sales',
    user 'reporter',
    password '***',          -- or use a secret reference if supported in your env
    ssl 'true'
  );

2) Create a foreign catalog (mirrors the external DB/catalog)

CREATE FOREIGN CATALOG ext_sales
  USING CONNECTION pg_conn
  OPTIONS (database 'sales');   -- mapping depends on the connector

3) Query it like any UC catalog

SELECT c.customer_id, o.order_id, o.amount
FROM ext_sales.public.orders o
JOIN ext_sales.public.customers c USING (customer_id)
WHERE o.order_date >= date('2025-01-01');

These objects are governed by UC; grant USE CATALOG/SCHEMA and SELECT to the right groups like you would for native UC catalogs. (Databricks Documentation)

Want click-ops? Catalog Explorer can create the connection and foreign catalog for you. (Microsoft Learn)


Privileges & governance (Unity Catalog)

  • Treat connections and foreign catalogs as securables: you can grant permissions centrally (e.g., who can use a connection). (Microsoft Learn)
  • Grant normal data privileges (USE CATALOG, USE SCHEMA, SELECT) on the foreign catalog or its schemas/tables. (Databricks Documentation)
  • UC also captures lineage spanning foreign catalogs and native Delta tables. (Databricks Documentation)

Supported sources (examples)

Connectors exist for many platforms including PostgreSQL, MySQL, SQL Server, Redshift, Snowflake, BigQuery, Salesforce Data Cloud, and another Databricks workspace (Databricks-to-Databricks). Check the connector page for exact options and limits. (Databricks, Databricks Documentation, Microsoft Learn)


Performance & cost tips

  • Filter early: write selective predicates—pushdown reduces rows pulled over JDBC. (Databricks Documentation)
  • Avoid large cross-source shuffles: join small→large, and consider ingesting the large side into Delta for repeat analytics.
  • Mind egress: remote reads can cross network boundaries; keep sources region-close to your warehouse.
  • Security first: prefer managed identities/secret scopes for credentials; use UC grants instead of app-level users.

Federation vs. ingestion (quick matrix)

Use caseFederationIngest into Delta
Quick POC / one-off analysis✅ No pipelines, live data
Operational DB reporting, light filters✅ Works well with pushdown⚠ Adds ETL
Heavy joins/aggregates at scale⚠ Can be slow/expensive✅ Best
Governance & lineage✅ UC governs foreign catalogs✅ UC native
Cost predictability⚠ Depends on remote DB & egress✅ Under your control

When both Lakehouse Federation and Lakeflow Connect are possible, Databricks suggests Lakeflow Connect/ingestion for higher-volume, lower-latency needs. (Databricks Documentation)


A tiny end-to-end example

Goal: Keep finance in PostgreSQL, modelers in Databricks.

  1. Create pg_conn and ext_fin foreign catalog (as above).
  2. Analysts query: SELECT a.account_id, SUM(t.amount) AS balance FROM ext_fin.public.transactions t JOIN ext_fin.public.accounts a USING (account_id) WHERE t.posted_at >= date('2025-01-01') GROUP BY a.account_id;
  3. If that query gets hot and heavy, ingest transactions into Delta (DLT/Jobs), then join locally for performance. Keep ext_fin.accounts federated if it’s small and needs to stay live.

Troubleshooting (fast)

  • “Object not found” → ensure you used 3-part names (catalog.schema.table) from the foreign catalog and have USE/SELECT grants. (Databricks Documentation)
  • “Slow query” → add selective predicates, limit columns, or ingest the large side.
  • “Auth/connection errors” → check the Connection object details/permissions and the source’s firewall/SSL settings. (Microsoft Learn)

Wrap-up

Federation gives you zero-ETL access to external systems, governed in one place with Unity Catalog. Start by creating a Connection and a Foreign catalog, grant least-privilege access, and use pushdown-friendly SQL. Move to ingestion when workloads get big or SLAs get tight. (Databricks Documentation)

Further reading & references

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments