Mohammad Gufran Jahangir August 13, 2025 0

Databricks COPY INTO is a lightweight, SQL‑first way to load files from cloud storage (ADLS/S3/GCS) into Delta tables. It’s retryable and idempotent—you can schedule it safely without creating duplicates because files that were already ingested are skipped on subsequent runs. This makes it perfect for incremental batch ingestion into the Bronze layer and a strong base for exactly‑once processing end‑to‑end.


TL;DR

  • What it does: Loads files into Delta exactly once per file path, even across retries.
  • Why it’s safe: It remembers which files were already loaded and skips them; you can preview with VALIDATE.
  • When to use: Simple/cheap batch ingestion from folders with thousands of files. Use Auto Loader for streaming/advanced patterns.
  • How to get “exactly‑once” end‑to‑end: Combine COPY INTO (file‑level idempotency) + Silver MERGE with a deterministic key (row‑level idempotency).

When to use COPY INTO vs. Auto Loader

  • COPY INTO
    • Pros: Minimal setup, SQL‑only, idempotent, great for scheduled incremental loads.
    • Cons: No schema inference on target unless you enable it; fewer advanced features than Auto Loader.
  • Auto Loader (cloudFiles)
    • Pros: Scales for very large directories, continuous ingestion, schema evolution, rescued data column.
    • Cons: Slightly more setup; streaming semantics.

Rule of thumb: If you batch new files hourly/daily and the directory size is manageable, start with COPY INTO. If files arrive continuously or the folder is huge/nested, use Auto Loader.


COPY INTO syntax—anatomy and options

COPY INTO <catalog>.<schema>.<delta_table>
  [ BY POSITION | ( col1, col2, ... ) ]
  FROM { '<path>' | ( SELECT <exprs> FROM '<path>' ) }
  FILEFORMAT = CSV | JSON | AVRO | ORC | PARQUET | TEXT | BINARYFILE
  [ VALIDATE [ ALL | <num> ROWS ] ]
  [ FILES = ('f1', 'f2', ...) | PATTERN = '<glob>' ]
  [ FORMAT_OPTIONS ( key = value [, ...] ) ]
  [ COPY_OPTIONS   ( key = value [, ...] ) ];

Frequently used options

  • PATTERN: Glob matching (e.g., '*.csv', 'date=2025-08-*/part-*').
  • BY POSITION (CSV only): Map source columns by ordinal (handy for headerless CSVs).
  • VALIDATE: Dry‑run; parse and return a small preview (up to 50 rows) without writing.
  • FORMAT_OPTIONS (reader options): header, inferSchema, sep, multiLine, etc.
  • COPY_OPTIONS (command behavior):
    • force = truedisables idempotency and reloads files.
    • mergeSchema = true → evolve target table to include new columns.

Tip: mergeSchema appears in two places. In FORMAT_OPTIONS it merges the input files’ schemas (reader‑side). In COPY_OPTIONS it evolves the target Delta table.


Minimal examples (Bronze)

1) CSV folder (headered) into an existing Delta table

COPY INTO dev.bronze.invoice_cp
FROM '/Volumes/dev/bronze/landing/input/'
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS('header' = 'true')
-- Evolve target if new columns appear
COPY_OPTIONS('mergeSchema' = 'true');

2) Headerless CSV mapped by position

COPY INTO dev.bronze.events
  BY POSITION
FROM '/Volumes/dev/bronze/landing/events/'
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'false', 'sep' = '|');

3) Selecting/transforming on the fly (JSON)

COPY INTO dev.bronze.measurements
FROM (
  SELECT to_timestamp(_json.dt) AS event_ts,
         _json.sensorId       AS sensor_id,
         cast(_json.value as double) AS reading,
         _metadata.file_name  AS src_file,
         _metadata.file_modification_time AS src_modified
  FROM '/Volumes/iot/landing/json/'
)
FILEFORMAT = JSON
FORMAT_OPTIONS('multiLine' = 'true');

The _metadata struct (file path, name, size, modification time, etc.) can be selected into your table during COPY. This is invaluable for lineage, audits, and downstream dedup.


Idempotency & COPY INTO metadata

Idempotent by design. COPY INTO remembers which files it has already ingested from a given path, and automatically skips them on subsequent runs—even if a file was later modified. This makes it safe to run on a schedule and resilient to retries.

Where do you “see” what happened?

  • Table history: DESCRIBE HISTORY <table> exposes an operationMetrics map with keys like numCopiedFiles, numAddedFiles, numSkippedFiles, numOutputRows, and ingestion timing. Use this for dashboards and alerts.
  • Row‑level lineage: Persist _metadata.file_name, _metadata.file_modification_time, and an ingestion_ts column in Bronze. You can then attribute every Silver/Gold record to a source file and load.

Example: pull metrics from history

SELECT timestamp, user_name, operation,
       operationMetrics:numCopiedFiles   AS copied_files,
       operationMetrics:numAddedFiles    AS added_files,
       operationMetrics:numSkippedFiles  AS skipped_files,
       operationMetrics:numOutputRows    AS output_rows
FROM   TABLE(describe_history('dev.bronze.invoice_cp'))
ORDER BY timestamp DESC
LIMIT 20;

Gotcha: Setting COPY_OPTIONS('force' = 'true') tells COPY INTO to re‑ingest files it has seen before. Use this only when you intentionally want to reprocess.


Designing an idempotent ingestion pipeline

1) Storage layout

  • Use Unity Catalog Volumes or External Locations and keep a clean landing path: /Volumes/<cat>/<schema>/<vol>/landing/<dataset>/.
  • Separate landing (raw drops) from processed/archive (optional) by moving/archiving after success.

2) File naming

  • Prefer immutable, unique names (e.g., include source create time + UUID). Avoid overwriting existing names in place.

3) SQL pattern

  • Use PATTERN/FILES to bound each run, and VALIDATE in test cells.
  • Default to not using force.
  • Use COPY_OPTIONS('mergeSchema'='true') only when real evolution is expected.

4) Observability

  • Capture DESCRIBE HISTORY metrics to a monitoring table.
  • Store _metadata fields in Bronze for per‑row/source attribution.

5) Scheduling

  • Wrap in a Databricks Job (Task) with retries; COPY INTO is retryable and safe to re‑invoke.

“Exactly‑once” processing: what it means here

  • File‑level exactly‑once: COPY INTO guarantees each file path is loaded once by default. Multiple runs won’t duplicate rows.
  • Row‑level exactly‑once: Achieve this in Silver by MERGE‑ing from Bronze with a deterministic key. If upstream re‑sends a file, your MERGE prevents duplicate business rows.

Bronze → Silver MERGE pattern

MERGE INTO dev.silver.invoice s
USING (
  SELECT invoice_id,
         invoice_date,
         amount,
         src_file,          -- from _metadata
         ingestion_ts
  FROM dev.bronze.invoice_cp
) b
ON s.invoice_id = b.invoice_id
WHEN MATCHED AND s.hash <> sha2(concat_ws('|', b.invoice_id, b.amount, b.invoice_date), 256)
  THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Notes

  • The hash (or full record) comparison gives update‑idempotency.
  • For append‑only feeds, you can simplify to WHEN NOT MATCHED THEN INSERT.

Validation & preview with VALIDATE

Use VALIDATE to check schema and preview rows without writing:

COPY INTO dev.bronze.invoice_cp
FROM '/Volumes/dev/bronze/landing/input/'
FILEFORMAT = CSV
FORMAT_OPTIONS('header'='true')
VALIDATE 20 ROWS;  -- parses & returns a small preview

This is perfect for notebooks, CI, and schema checks.


Practical recipes

A) CSV with selective FILES list (hotfix backfill)

COPY INTO fin.bronze.txn
FROM 'abfss://raw@acct.dfs.core.windows.net/payments/2025/08/15/'
FILEFORMAT = CSV
FILES = ('txn_001.csv','txn_007.csv')
FORMAT_OPTIONS('header'='true');

B) JSON with on‑the‑fly projection & constants

COPY INTO iot.bronze.readings
FROM (
  SELECT cast(_json.ts as timestamp) as event_ts,
         _json.deviceId as device_id,
         _json.payload.value as reading,
         'v2' as schema_version,
         current_timestamp() as ingestion_ts
  FROM '/Volumes/iot/landing/json/v2/'
)
FILEFORMAT = JSON;

C) Parquet with schema evolution to the target

COPY INTO crm.bronze.customers
FROM '/Volumes/crm/landing/parquet/'
FILEFORMAT = PARQUET
COPY_OPTIONS('mergeSchema'='true');

Operational monitoring (what to track)

Create a lightweight audit table fed by DESCRIBE HISTORY:

CREATE TABLE IF NOT EXISTS ops.copy_into_audit (
  table_name   STRING,
  ts           TIMESTAMP,
  user_name    STRING,
  copied_files INT,
  added_files  INT,
  skipped_files INT,
  output_rows  BIGINT,
  operation_id STRING
);

INSERT INTO ops.copy_into_audit
SELECT 'dev.bronze.invoice_cp',
       timestamp,
       user_name,
       operationMetrics:numCopiedFiles,
       operationMetrics:numAddedFiles,
       operationMetrics:numSkippedFiles,
       operationMetrics:numOutputRows,
       operationParameters:operationId
FROM TABLE(describe_history('dev.bronze.invoice_cp'))
WHERE operation = 'COPY INTO'
  AND timestamp >= date_sub(current_timestamp(), 1);

Troubleshooting & gotchas

  • Files changed after load: COPY INTO still skips them by default. If you truly need a re‑load, use COPY_OPTIONS('force'='true') and handle duplicates downstream (e.g., MERGE in Silver).
  • Header/casing mismatches (CSV): Use BY POSITION for strict positional mapping or set readerCaseSensitive appropriately.
  • Corrupt files: Set FORMAT_OPTIONS('ignoreCorruptFiles'='true') to skip, then inspect in history; consider quarantining.
  • Massive directories: COPY INTO lists files each run; for very large nested paths, prefer Auto Loader.
  • Schema drift: Keep FORMAT_OPTIONS('mergeSchema'='false') by default; enable
    COPY_OPTIONS('mergeSchema'='true') only when you intend to evolve the target.

End‑to‑end pattern (Bronze → Silver → Gold)

  1. Bronze (COPY INTO): append files exactly once per path; carry _metadata + ingestion_ts.
  2. Silver (MERGE): deduplicate/upsert by business key; derive hash for change detection.
  3. Gold: serve clean, conformed facts/dimensions; add constraints and quality checks.

Simple, scalable, and resilient.


Appendix: Cheat‑sheet

  • Idempotency switch: force=false (default) → skip seen files; true → re‑ingest.
  • Preview: VALIDATE [ALL | N ROWS].
  • Pick files: FILES=(...) or PATTERN='*.csv'.
  • Headerless CSV: BY POSITION.
  • Evolve target: COPY_OPTIONS('mergeSchema'='true').
  • Capture lineage: select _metadata.* into Bronze.

Happy ingesting!

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments