Mohammad Gufran Jahangir August 3, 2025 0

📦 Databricks Unity Catalog: Volume – Full Explanation

🔹 What is a Volume?

A Volume in Databricks Unity Catalog is a secure, governed folder used to store non-tabular data like:

  • CSV, JSON, and Parquet files
  • Images, PDFs, logs
  • Machine learning models or artifacts

It acts like a data lake directory, but with fine-grained access control, audit logging, and native Databricks support.


🔹 Why Use a Volume?

BenefitDescription
Secure StorageAccess is controlled via Unity Catalog privileges like READ_VOLUME, WRITE_VOLUME
🔍 GovernedActivity is audited just like with tables
🔄 ReusableCan be shared across notebooks, workflows, and ML models
🔧 Simplifies AccessNo need to manually mount or manage external paths in code

🔹 Volume Location

A volume is always created inside a catalog → schema → volume path like:

/Volumes/<catalog>/<schema>/<volume>

🔹 Volume Types (as seen in UI)

1. 🔐 Managed Volume

  • Stored in Databricks-managed cloud storage.
  • Databricks handles the physical storage location.
  • Deleted automatically when dropped.
  • Use Case: Temporary datasets, quick experiments, staging data.

2. 🌐 External Volume

  • Points to your own cloud storage path (ADLS, S3, GCS).
  • Requires a Storage Credential (IAM role, SAS token, etc.).
  • Storage persists even if volume is dropped.
  • Use Case: Production data lakes, shared cloud buckets.

🔹 Creating a Volume (via UI or SQL)

🔧 Via UI:

  • Choose name, volume type, optional comment.
  • For external, you also need to choose storage credential and path.

🧾 Via SQL:

-- Managed Volume
CREATE VOLUME catalog.schema.volume_name
COMMENT 'This is a managed volume for training data';

-- External Volume
CREATE EXTERNAL VOLUME catalog.schema.external_volume
LOCATION 'abfss://container@account.dfs.core.windows.net/my-data'
WITH STORAGE CREDENTIAL my_credential
COMMENT 'External volume for shared S3 bucket';

🔹 Volume Privileges

PrivilegeDescription
USE_VOLUMEReference the volume
READ_VOLUMERead files inside the volume
WRITE_VOLUMEAdd, delete, or update files
OWNFull control (ownership, permissions)
GRANT READ_VOLUME, WRITE_VOLUME
ON VOLUME catalog.schema.volume_name
TO 'user@company.com';

🔹 Where to Use Volumes

  • In notebooks:
df = spark.read.csv('/Volumes/catalog/schema/volume_name/data.csv')
  • In ML pipelines or Delta Live Tables
  • For storing checkpoints, logs, images, model outputs, etc.

🔹 Governance & Lineage

  • Volumes are listed in information_schema.volumes
  • Support audit logging via system.access.audit
  • Integrated with Unity Catalog’s permission model

✅ Summary

FeatureManaged VolumeExternal Volume
Stored inDatabricks-managed storageYour cloud (S3/ADLS/GCS)
Auto-cleanup✅ Yes❌ No
Needs Storage Credential❌ No✅ Yes
Governance✅ Yes✅ Yes
Access via /Volumes/✅ Yes✅ Yes

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments