🗂️ Databricks File System (DBFS) and Mounting Azure Data Lake Containers
Databricks offers a virtual distributed file system called DBFS which simplifies data access for notebooks, jobs, and clusters. To integrate external cloud storage like Azure Data Lake (ADLS Gen2), mounts are used — providing a convenient way to navigate and work with cloud data as if it’s part of a local filesystem.
Let’s dive into what DBFS is, how mounts work, and how to securely mount ADLS containers.
📦 What is Databricks File System (DBFS)?
DBFS is a distributed file system built on top of Azure Blob Storage, enabling seamless access to data stored in the Databricks workspace.
📌 Key Characteristics:
- DBFS is mounted into the Databricks workspace root.
- It acts as a unified interface for both ephemeral (cluster-local) and persistent storage.
- Can be accessed from notebooks, clusters, and jobs.
- Includes directories like
/mnt
,/databricks
,/dbfs
(depending on context).
🔍 DBFS Architecture Overview
Control Plane
- Managed by Databricks
- Includes workspace UX, cluster manager, job scheduler
Data Plane
- Resides in your Azure subscription
- Includes VMs, Spark compute, and actual storage access
- ADLS and Azure Blob act as the backend
📁 What is DBFS Root?
The DBFS Root (e.g., /dbfs/
) is the default storage for files in a Databricks workspace, backed by Azure Blob Storage.
✅ Features:
- Accessible via the Databricks Web UI (e.g., Data tab > DBFS)
- Ideal for notebooks, libraries, input/output files
- Used by default when saving data temporarily
❗ Considerations:
- Not recommended for long-term production storage
- Use mounts instead for accessing external storage like ADLS Gen2
🔗 What Are Databricks Mounts?
Mounts are persistent mount points that link external cloud storage (like Azure Data Lake or Azure Blob) to a specific directory in DBFS.
Once mounted:
- You can use regular file system commands (
%fs ls
,dbutils.fs.ls()
, etc.) - You don’t need to re-authenticate or manage tokens repeatedly
- Mounts persist across sessions and clusters
🚀 How to Mount an ADLS Gen2 Container to Databricks
Mounting ADLS requires proper authentication using either access keys or service principals.
🔑 Example: Mount Using Access Key
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account>.dfs.core.windows.net/",
mount_point = "/mnt/mydata",
extra_configs = {
"fs.azure.account.key.<storage-account>.dfs.core.windows.net": "<access-key>"
}
)
🛡️ Example: Mount Using Service Principal
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<client-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="myscope", key="client-secret"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account>.dfs.core.windows.net/",
mount_point = "/mnt/mydata",
extra_configs = configs
)
🧼 How to Unmount a Container
dbutils.fs.unmount("/mnt/mydata")
🔍 How to List Files
display(dbutils.fs.ls("/mnt/mydata"))
Or:
%fs ls /mnt/mydata
📊 Use Cases for Mounts
Use Case | Mount Usage |
---|---|
Accessing Azure Data Lake | ✅ Recommended |
Long-term storage | ✅ Recommended |
Temporary scratchpad | ❌ Use DBFS Root |
Production pipelines | ✅ Recommended |
Shared datasets between teams | ✅ Recommended |
📘 Summary
Concept | Description |
---|---|
DBFS | Virtual file system in Databricks backed by Blob Storage |
DBFS Root | Default internal storage – good for scratchpad, not for production data |
Mounts | Permanent links to external storage (like ADLS) using /mnt/... path |
Access Method | Use Access Keys or Service Principals (prefer secrets for security) |
Best Practice | Secure credentials with Databricks Secret Scope or Azure Key Vault |