📦 Databricks Unity Catalog: Volume – Full Explanation
🔹 What is a Volume?
A Volume in Databricks Unity Catalog is a secure, governed folder used to store non-tabular data like:
- CSV, JSON, and Parquet files
- Images, PDFs, logs
- Machine learning models or artifacts
It acts like a data lake directory, but with fine-grained access control, audit logging, and native Databricks support.
🔹 Why Use a Volume?
| Benefit | Description |
|---|---|
| ✅ Secure Storage | Access is controlled via Unity Catalog privileges like READ_VOLUME, WRITE_VOLUME |
| 🔍 Governed | Activity is audited just like with tables |
| 🔄 Reusable | Can be shared across notebooks, workflows, and ML models |
| 🔧 Simplifies Access | No need to manually mount or manage external paths in code |
🔹 Volume Location
A volume is always created inside a catalog → schema → volume path like:
/Volumes/<catalog>/<schema>/<volume>
🔹 Volume Types (as seen in UI)
1. 🔐 Managed Volume
- Stored in Databricks-managed cloud storage.
- Databricks handles the physical storage location.
- Deleted automatically when dropped.
- Use Case: Temporary datasets, quick experiments, staging data.
2. 🌐 External Volume
- Points to your own cloud storage path (ADLS, S3, GCS).
- Requires a Storage Credential (IAM role, SAS token, etc.).
- Storage persists even if volume is dropped.
- Use Case: Production data lakes, shared cloud buckets.
🔹 Creating a Volume (via UI or SQL)
🔧 Via UI:
- Choose name, volume type, optional comment.
- For external, you also need to choose storage credential and path.
🧾 Via SQL:
-- Managed Volume
CREATE VOLUME catalog.schema.volume_name
COMMENT 'This is a managed volume for training data';
-- External Volume
CREATE EXTERNAL VOLUME catalog.schema.external_volume
LOCATION 'abfss://container@account.dfs.core.windows.net/my-data'
WITH STORAGE CREDENTIAL my_credential
COMMENT 'External volume for shared S3 bucket';
🔹 Volume Privileges
| Privilege | Description |
|---|---|
USE_VOLUME | Reference the volume |
READ_VOLUME | Read files inside the volume |
WRITE_VOLUME | Add, delete, or update files |
OWN | Full control (ownership, permissions) |
GRANT READ_VOLUME, WRITE_VOLUME
ON VOLUME catalog.schema.volume_name
TO 'user@company.com';
🔹 Where to Use Volumes
- In notebooks:
df = spark.read.csv('/Volumes/catalog/schema/volume_name/data.csv')
- In ML pipelines or Delta Live Tables
- For storing checkpoints, logs, images, model outputs, etc.
🔹 Governance & Lineage
- Volumes are listed in information_schema.volumes
- Support audit logging via
system.access.audit - Integrated with Unity Catalog’s permission model
✅ Summary
| Feature | Managed Volume | External Volume |
|---|---|---|
| Stored in | Databricks-managed storage | Your cloud (S3/ADLS/GCS) |
| Auto-cleanup | ✅ Yes | ❌ No |
| Needs Storage Credential | ❌ No | ✅ Yes |
| Governance | ✅ Yes | ✅ Yes |
Access via /Volumes/ | ✅ Yes | ✅ Yes |
Category: