Databricks COPY INTO is a lightweight, SQL‑first way to load files from cloud storage (ADLS/S3/GCS) into Delta tables. It’s retryable and idempotent—you can schedule it safely without creating duplicates because files that were already ingested are skipped on subsequent runs. This makes it perfect for incremental batch ingestion into the Bronze layer and a strong base for exactly‑once processing end‑to‑end.
TL;DR
- What it does: Loads files into Delta exactly once per file path, even across retries.
- Why it’s safe: It remembers which files were already loaded and skips them; you can preview with
VALIDATE. - When to use: Simple/cheap batch ingestion from folders with thousands of files. Use Auto Loader for streaming/advanced patterns.
- How to get “exactly‑once” end‑to‑end: Combine COPY INTO (file‑level idempotency) + Silver MERGE with a deterministic key (row‑level idempotency).
When to use COPY INTO vs. Auto Loader
- COPY INTO
- Pros: Minimal setup, SQL‑only, idempotent, great for scheduled incremental loads.
- Cons: No schema inference on target unless you enable it; fewer advanced features than Auto Loader.
- Auto Loader (cloudFiles)
- Pros: Scales for very large directories, continuous ingestion, schema evolution, rescued data column.
- Cons: Slightly more setup; streaming semantics.
Rule of thumb: If you batch new files hourly/daily and the directory size is manageable, start with COPY INTO. If files arrive continuously or the folder is huge/nested, use Auto Loader.
COPY INTO syntax—anatomy and options
COPY INTO <catalog>.<schema>.<delta_table>
[ BY POSITION | ( col1, col2, ... ) ]
FROM { '<path>' | ( SELECT <exprs> FROM '<path>' ) }
FILEFORMAT = CSV | JSON | AVRO | ORC | PARQUET | TEXT | BINARYFILE
[ VALIDATE [ ALL | <num> ROWS ] ]
[ FILES = ('f1', 'f2', ...) | PATTERN = '<glob>' ]
[ FORMAT_OPTIONS ( key = value [, ...] ) ]
[ COPY_OPTIONS ( key = value [, ...] ) ];
Frequently used options
PATTERN: Glob matching (e.g.,'*.csv','date=2025-08-*/part-*').BY POSITION(CSV only): Map source columns by ordinal (handy for headerless CSVs).VALIDATE: Dry‑run; parse and return a small preview (up to 50 rows) without writing.FORMAT_OPTIONS(reader options):header,inferSchema,sep,multiLine, etc.COPY_OPTIONS(command behavior):force = true→ disables idempotency and reloads files.mergeSchema = true→ evolve target table to include new columns.
Tip:
mergeSchemaappears in two places. InFORMAT_OPTIONSit merges the input files’ schemas (reader‑side). InCOPY_OPTIONSit evolves the target Delta table.
Minimal examples (Bronze)
1) CSV folder (headered) into an existing Delta table
COPY INTO dev.bronze.invoice_cp
FROM '/Volumes/dev/bronze/landing/input/'
FILEFORMAT = CSV
PATTERN = '*.csv'
FORMAT_OPTIONS('header' = 'true')
-- Evolve target if new columns appear
COPY_OPTIONS('mergeSchema' = 'true');
2) Headerless CSV mapped by position
COPY INTO dev.bronze.events
BY POSITION
FROM '/Volumes/dev/bronze/landing/events/'
FILEFORMAT = CSV
FORMAT_OPTIONS('header' = 'false', 'sep' = '|');
3) Selecting/transforming on the fly (JSON)
COPY INTO dev.bronze.measurements
FROM (
SELECT to_timestamp(_json.dt) AS event_ts,
_json.sensorId AS sensor_id,
cast(_json.value as double) AS reading,
_metadata.file_name AS src_file,
_metadata.file_modification_time AS src_modified
FROM '/Volumes/iot/landing/json/'
)
FILEFORMAT = JSON
FORMAT_OPTIONS('multiLine' = 'true');
The
_metadatastruct (file path, name, size, modification time, etc.) can be selected into your table during COPY. This is invaluable for lineage, audits, and downstream dedup.
Idempotency & COPY INTO metadata
Idempotent by design. COPY INTO remembers which files it has already ingested from a given path, and automatically skips them on subsequent runs—even if a file was later modified. This makes it safe to run on a schedule and resilient to retries.
Where do you “see” what happened?
- Table history:
DESCRIBE HISTORY <table>exposes anoperationMetricsmap with keys likenumCopiedFiles,numAddedFiles,numSkippedFiles,numOutputRows, and ingestion timing. Use this for dashboards and alerts. - Row‑level lineage: Persist
_metadata.file_name,_metadata.file_modification_time, and aningestion_tscolumn in Bronze. You can then attribute every Silver/Gold record to a source file and load.
Example: pull metrics from history
SELECT timestamp, user_name, operation,
operationMetrics:numCopiedFiles AS copied_files,
operationMetrics:numAddedFiles AS added_files,
operationMetrics:numSkippedFiles AS skipped_files,
operationMetrics:numOutputRows AS output_rows
FROM TABLE(describe_history('dev.bronze.invoice_cp'))
ORDER BY timestamp DESC
LIMIT 20;
Gotcha: Setting
COPY_OPTIONS('force' = 'true')tells COPY INTO to re‑ingest files it has seen before. Use this only when you intentionally want to reprocess.
Designing an idempotent ingestion pipeline
1) Storage layout
- Use Unity Catalog Volumes or External Locations and keep a clean landing path:
/Volumes/<cat>/<schema>/<vol>/landing/<dataset>/. - Separate landing (raw drops) from processed/archive (optional) by moving/archiving after success.
2) File naming
- Prefer immutable, unique names (e.g., include source create time + UUID). Avoid overwriting existing names in place.
3) SQL pattern
- Use
PATTERN/FILESto bound each run, andVALIDATEin test cells. - Default to not using
force. - Use
COPY_OPTIONS('mergeSchema'='true')only when real evolution is expected.
4) Observability
- Capture
DESCRIBE HISTORYmetrics to a monitoring table. - Store
_metadatafields in Bronze for per‑row/source attribution.
5) Scheduling
- Wrap in a Databricks Job (Task) with retries; COPY INTO is retryable and safe to re‑invoke.
“Exactly‑once” processing: what it means here
- File‑level exactly‑once: COPY INTO guarantees each file path is loaded once by default. Multiple runs won’t duplicate rows.
- Row‑level exactly‑once: Achieve this in Silver by MERGE‑ing from Bronze with a deterministic key. If upstream re‑sends a file, your MERGE prevents duplicate business rows.
Bronze → Silver MERGE pattern
MERGE INTO dev.silver.invoice s
USING (
SELECT invoice_id,
invoice_date,
amount,
src_file, -- from _metadata
ingestion_ts
FROM dev.bronze.invoice_cp
) b
ON s.invoice_id = b.invoice_id
WHEN MATCHED AND s.hash <> sha2(concat_ws('|', b.invoice_id, b.amount, b.invoice_date), 256)
THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Notes
- The
hash(or full record) comparison gives update‑idempotency. - For append‑only feeds, you can simplify to
WHEN NOT MATCHED THEN INSERT.
Validation & preview with VALIDATE
Use VALIDATE to check schema and preview rows without writing:
COPY INTO dev.bronze.invoice_cp
FROM '/Volumes/dev/bronze/landing/input/'
FILEFORMAT = CSV
FORMAT_OPTIONS('header'='true')
VALIDATE 20 ROWS; -- parses & returns a small preview
This is perfect for notebooks, CI, and schema checks.
Practical recipes
A) CSV with selective FILES list (hotfix backfill)
COPY INTO fin.bronze.txn
FROM 'abfss://raw@acct.dfs.core.windows.net/payments/2025/08/15/'
FILEFORMAT = CSV
FILES = ('txn_001.csv','txn_007.csv')
FORMAT_OPTIONS('header'='true');
B) JSON with on‑the‑fly projection & constants
COPY INTO iot.bronze.readings
FROM (
SELECT cast(_json.ts as timestamp) as event_ts,
_json.deviceId as device_id,
_json.payload.value as reading,
'v2' as schema_version,
current_timestamp() as ingestion_ts
FROM '/Volumes/iot/landing/json/v2/'
)
FILEFORMAT = JSON;
C) Parquet with schema evolution to the target
COPY INTO crm.bronze.customers
FROM '/Volumes/crm/landing/parquet/'
FILEFORMAT = PARQUET
COPY_OPTIONS('mergeSchema'='true');
Operational monitoring (what to track)
Create a lightweight audit table fed by DESCRIBE HISTORY:
CREATE TABLE IF NOT EXISTS ops.copy_into_audit (
table_name STRING,
ts TIMESTAMP,
user_name STRING,
copied_files INT,
added_files INT,
skipped_files INT,
output_rows BIGINT,
operation_id STRING
);
INSERT INTO ops.copy_into_audit
SELECT 'dev.bronze.invoice_cp',
timestamp,
user_name,
operationMetrics:numCopiedFiles,
operationMetrics:numAddedFiles,
operationMetrics:numSkippedFiles,
operationMetrics:numOutputRows,
operationParameters:operationId
FROM TABLE(describe_history('dev.bronze.invoice_cp'))
WHERE operation = 'COPY INTO'
AND timestamp >= date_sub(current_timestamp(), 1);
Troubleshooting & gotchas
- Files changed after load: COPY INTO still skips them by default. If you truly need a re‑load, use
COPY_OPTIONS('force'='true')and handle duplicates downstream (e.g., MERGE in Silver). - Header/casing mismatches (CSV): Use
BY POSITIONfor strict positional mapping or setreaderCaseSensitiveappropriately. - Corrupt files: Set
FORMAT_OPTIONS('ignoreCorruptFiles'='true')to skip, then inspect in history; consider quarantining. - Massive directories: COPY INTO lists files each run; for very large nested paths, prefer Auto Loader.
- Schema drift: Keep
FORMAT_OPTIONS('mergeSchema'='false')by default; enableCOPY_OPTIONS('mergeSchema'='true')only when you intend to evolve the target.
End‑to‑end pattern (Bronze → Silver → Gold)
- Bronze (COPY INTO): append files exactly once per path; carry
_metadata+ingestion_ts. - Silver (MERGE): deduplicate/upsert by business key; derive
hashfor change detection. - Gold: serve clean, conformed facts/dimensions; add constraints and quality checks.
Simple, scalable, and resilient.
Appendix: Cheat‑sheet
- Idempotency switch:
force=false(default) → skip seen files;true→ re‑ingest. - Preview:
VALIDATE [ALL | N ROWS]. - Pick files:
FILES=(...)orPATTERN='*.csv'. - Headerless CSV:
BY POSITION. - Evolve target:
COPY_OPTIONS('mergeSchema'='true'). - Capture lineage: select
_metadata.*into Bronze.
Happy ingesting!