What is a streaming table in Databricks?
A streaming table is a Delta table that Databricks keeps up-to-date automatically as new data arrives. It uses Structured Streaming under the hood (micro-batches + checkpoints) so writes are fault-tolerant and exactly-once.
Think of it as: “a table that continuously ingests from a source (files, Kafka, etc.) without you running manual batch jobs.”
Why use it?
- New files/records are picked up incrementally
- Crash/retry safe (uses a checkpoint)
- Plays nicely with the lakehouse: downstream tables can read it incrementally
Two common ways to create one
1) With Python (Jobs/Notebooks)
from pyspark.sql.functions import current_timestamp
SRC = "/Volumes/dev/landing/events" # or s3://, abfss://, gs://
CHKPT = "/Volumes/dev/ops/_chk/events_stream" # checkpoint path
df = (spark.readStream
.format("cloudFiles") # Auto Loader
.option("cloudFiles.format","json")
.option("cloudFiles.inferColumnTypes","true")
.load(SRC)
.withColumn("ingestion_ts", current_timestamp()))
(df.writeStream
.option("checkpointLocation", CHKPT)
.toTable("dev.bronze.events")) # <-- streaming table (Delta)
2) With DLT (Delta Live Tables) — SQL
CREATE STREAMING LIVE TABLE bronze_events
AS SELECT *, current_timestamp() AS ingestion_ts
FROM cloud_files(
'/Volumes/dev/landing/events', -- source folder
'json',
map('inferColumnTypes','true','schemaLocation','/Volumes/dev/ops/_schemas/events')
);
In DLT you’ll read it downstream with
STREAM(LIVE.bronze_events)for incremental processing.
How do I use/consume it?
- From Spark/PySpark:
spark.readStream.table("dev.bronze.events")(for further streaming transforms), orspark.read.table(...)for point-in-time batch reads. - From SQL/BI: it’s just a Delta table—query it like normal. (Under the hood, another job keeps it fresh.)
Key concepts (in one line each)
- Checkpoint: folder where the stream’s progress is saved (required).
- Exactly-once: committed data won’t be duplicated after retries.
- Auto Loader: easiest way to stream files; handles discovery & schema drift.
- Schema evolution: allow new columns via options (e.g., addNewColumns) or park extras in a rescued column.
- Dedup/windows (optional): use watermarks +
dropDuplicatesor apply MERGE logic downstream in Silver.
When to use a streaming table vs. other options
- Streaming table: you want continuous ingestion with reliability (files/Kafka → Bronze).
- Materialized view: you want fast reads of a heavy aggregate refreshed on a schedule.
- Regular table: you load in batches (ad-hoc or scheduled copy).
Category: