π¦ Databricks Unity Catalog: Volume β Full Explanation
πΉ What is a Volume?
A Volume in Databricks Unity Catalog is a secure, governed folder used to store non-tabular data like:
- CSV, JSON, and Parquet files
- Images, PDFs, logs
- Machine learning models or artifacts
It acts like a data lake directory, but with fine-grained access control, audit logging, and native Databricks support.
πΉ Why Use a Volume?
| Benefit | Description |
|---|---|
| β Secure Storage | Access is controlled via Unity Catalog privileges like READ_VOLUME, WRITE_VOLUME |
| π Governed | Activity is audited just like with tables |
| π Reusable | Can be shared across notebooks, workflows, and ML models |
| π§ Simplifies Access | No need to manually mount or manage external paths in code |
πΉ Volume Location
A volume is always created inside a catalog β schema β volume path like:
/Volumes/<catalog>/<schema>/<volume>
πΉ Volume Types (as seen in UI)
1. π Managed Volume
- Stored in Databricks-managed cloud storage.
- Databricks handles the physical storage location.
- Deleted automatically when dropped.
- Use Case: Temporary datasets, quick experiments, staging data.
2. π External Volume
- Points to your own cloud storage path (ADLS, S3, GCS).
- Requires a Storage Credential (IAM role, SAS token, etc.).
- Storage persists even if volume is dropped.
- Use Case: Production data lakes, shared cloud buckets.
πΉ Creating a Volume (via UI or SQL)
π§ Via UI:
- Choose name, volume type, optional comment.
- For external, you also need to choose storage credential and path.
π§Ύ Via SQL:
-- Managed Volume
CREATE VOLUME catalog.schema.volume_name
COMMENT 'This is a managed volume for training data';
-- External Volume
CREATE EXTERNAL VOLUME catalog.schema.external_volume
LOCATION 'abfss://container@account.dfs.core.windows.net/my-data'
WITH STORAGE CREDENTIAL my_credential
COMMENT 'External volume for shared S3 bucket';
πΉ Volume Privileges
| Privilege | Description |
|---|---|
USE_VOLUME | Reference the volume |
READ_VOLUME | Read files inside the volume |
WRITE_VOLUME | Add, delete, or update files |
OWN | Full control (ownership, permissions) |
GRANT READ_VOLUME, WRITE_VOLUME
ON VOLUME catalog.schema.volume_name
TO 'user@company.com';
πΉ Where to Use Volumes
- In notebooks:
df = spark.read.csv('/Volumes/catalog/schema/volume_name/data.csv')
- In ML pipelines or Delta Live Tables
- For storing checkpoints, logs, images, model outputs, etc.
πΉ Governance & Lineage
- Volumes are listed in information_schema.volumes
- Support audit logging via
system.access.audit - Integrated with Unity Catalogβs permission model
β Summary
| Feature | Managed Volume | External Volume |
|---|---|---|
| Stored in | Databricks-managed storage | Your cloud (S3/ADLS/GCS) |
| Auto-cleanup | β Yes | β No |
| Needs Storage Credential | β No | β Yes |
| Governance | β Yes | β Yes |
Access via /Volumes/ | β Yes | β Yes |
Category: