Mohammad Gufran Jahangir August 30, 2025 0

What is a streaming table in Databricks?

A streaming table is a Delta table that Databricks keeps up-to-date automatically as new data arrives. It uses Structured Streaming under the hood (micro-batches + checkpoints) so writes are fault-tolerant and exactly-once.

Think of it as: “a table that continuously ingests from a source (files, Kafka, etc.) without you running manual batch jobs.”


Why use it?

  • New files/records are picked up incrementally
  • Crash/retry safe (uses a checkpoint)
  • Plays nicely with the lakehouse: downstream tables can read it incrementally

Two common ways to create one

1) With Python (Jobs/Notebooks)

from pyspark.sql.functions import current_timestamp

SRC   = "/Volumes/dev/landing/events"         # or s3://, abfss://, gs://
CHKPT = "/Volumes/dev/ops/_chk/events_stream" # checkpoint path

df = (spark.readStream
        .format("cloudFiles")                  # Auto Loader
        .option("cloudFiles.format","json")
        .option("cloudFiles.inferColumnTypes","true")
        .load(SRC)
        .withColumn("ingestion_ts", current_timestamp()))

(df.writeStream
   .option("checkpointLocation", CHKPT)
   .toTable("dev.bronze.events"))              # <-- streaming table (Delta)

2) With DLT (Delta Live Tables) — SQL

CREATE STREAMING LIVE TABLE bronze_events
AS SELECT *, current_timestamp() AS ingestion_ts
FROM cloud_files(
  '/Volumes/dev/landing/events',   -- source folder
  'json',
  map('inferColumnTypes','true','schemaLocation','/Volumes/dev/ops/_schemas/events')
);

In DLT you’ll read it downstream with STREAM(LIVE.bronze_events) for incremental processing.


How do I use/consume it?

  • From Spark/PySpark: spark.readStream.table("dev.bronze.events") (for further streaming transforms), or spark.read.table(...) for point-in-time batch reads.
  • From SQL/BI: it’s just a Delta table—query it like normal. (Under the hood, another job keeps it fresh.)

Key concepts (in one line each)

  • Checkpoint: folder where the stream’s progress is saved (required).
  • Exactly-once: committed data won’t be duplicated after retries.
  • Auto Loader: easiest way to stream files; handles discovery & schema drift.
  • Schema evolution: allow new columns via options (e.g., addNewColumns) or park extras in a rescued column.
  • Dedup/windows (optional): use watermarks + dropDuplicates or apply MERGE logic downstream in Silver.

When to use a streaming table vs. other options

  • Streaming table: you want continuous ingestion with reliability (files/Kafka → Bronze).
  • Materialized view: you want fast reads of a heavy aggregate refreshed on a schedule.
  • Regular table: you load in batches (ad-hoc or scheduled copy).

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments