Mohammad Gufran Jahangir August 3, 2025 0

πŸ“¦ Databricks Unity Catalog: Volume – Full Explanation

πŸ”Ή What is a Volume?

A Volume in Databricks Unity Catalog is a secure, governed folder used to store non-tabular data like:

  • CSV, JSON, and Parquet files
  • Images, PDFs, logs
  • Machine learning models or artifacts

It acts like a data lake directory, but with fine-grained access control, audit logging, and native Databricks support.


πŸ”Ή Why Use a Volume?

BenefitDescription
βœ… Secure StorageAccess is controlled via Unity Catalog privileges like READ_VOLUME, WRITE_VOLUME
πŸ” GovernedActivity is audited just like with tables
πŸ”„ ReusableCan be shared across notebooks, workflows, and ML models
πŸ”§ Simplifies AccessNo need to manually mount or manage external paths in code

πŸ”Ή Volume Location

A volume is always created inside a catalog β†’ schema β†’ volume path like:

/Volumes/<catalog>/<schema>/<volume>

πŸ”Ή Volume Types (as seen in UI)

1. πŸ” Managed Volume

  • Stored in Databricks-managed cloud storage.
  • Databricks handles the physical storage location.
  • Deleted automatically when dropped.
  • Use Case: Temporary datasets, quick experiments, staging data.

2. 🌐 External Volume

  • Points to your own cloud storage path (ADLS, S3, GCS).
  • Requires a Storage Credential (IAM role, SAS token, etc.).
  • Storage persists even if volume is dropped.
  • Use Case: Production data lakes, shared cloud buckets.

πŸ”Ή Creating a Volume (via UI or SQL)

πŸ”§ Via UI:

  • Choose name, volume type, optional comment.
  • For external, you also need to choose storage credential and path.

🧾 Via SQL:

-- Managed Volume
CREATE VOLUME catalog.schema.volume_name
COMMENT 'This is a managed volume for training data';

-- External Volume
CREATE EXTERNAL VOLUME catalog.schema.external_volume
LOCATION 'abfss://container@account.dfs.core.windows.net/my-data'
WITH STORAGE CREDENTIAL my_credential
COMMENT 'External volume for shared S3 bucket';

πŸ”Ή Volume Privileges

PrivilegeDescription
USE_VOLUMEReference the volume
READ_VOLUMERead files inside the volume
WRITE_VOLUMEAdd, delete, or update files
OWNFull control (ownership, permissions)
GRANT READ_VOLUME, WRITE_VOLUME
ON VOLUME catalog.schema.volume_name
TO 'user@company.com';

πŸ”Ή Where to Use Volumes

  • In notebooks:
df = spark.read.csv('/Volumes/catalog/schema/volume_name/data.csv')
  • In ML pipelines or Delta Live Tables
  • For storing checkpoints, logs, images, model outputs, etc.

πŸ”Ή Governance & Lineage

  • Volumes are listed in information_schema.volumes
  • Support audit logging via system.access.audit
  • Integrated with Unity Catalog’s permission model

βœ… Summary

FeatureManaged VolumeExternal Volume
Stored inDatabricks-managed storageYour cloud (S3/ADLS/GCS)
Auto-cleanupβœ… Yes❌ No
Needs Storage Credential❌ Noβœ… Yes
Governanceβœ… Yesβœ… Yes
Access via /Volumes/βœ… Yesβœ… Yes

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments