ποΈ Databricks File System (DBFS) and Mounting Azure Data Lake Containers
Databricks offers a virtual distributed file system called DBFS which simplifies data access for notebooks, jobs, and clusters. To integrate external cloud storage like Azure Data Lake (ADLS Gen2), mounts are used β providing a convenient way to navigate and work with cloud data as if it’s part of a local filesystem.
Letβs dive into what DBFS is, how mounts work, and how to securely mount ADLS containers.
π¦ What is Databricks File System (DBFS)?
DBFS is a distributed file system built on top of Azure Blob Storage, enabling seamless access to data stored in the Databricks workspace.
π Key Characteristics:
- DBFS is mounted into the Databricks workspace root.
- It acts as a unified interface for both ephemeral (cluster-local) and persistent storage.
- Can be accessed from notebooks, clusters, and jobs.
- Includes directories like
/mnt,/databricks,/dbfs(depending on context).
π DBFS Architecture Overview
Control Plane
- Managed by Databricks
- Includes workspace UX, cluster manager, job scheduler
Data Plane
- Resides in your Azure subscription
- Includes VMs, Spark compute, and actual storage access
- ADLS and Azure Blob act as the backend
π What is DBFS Root?
The DBFS Root (e.g., /dbfs/) is the default storage for files in a Databricks workspace, backed by Azure Blob Storage.
β Features:
- Accessible via the Databricks Web UI (e.g., Data tab > DBFS)
- Ideal for notebooks, libraries, input/output files
- Used by default when saving data temporarily
β Considerations:
- Not recommended for long-term production storage
- Use mounts instead for accessing external storage like ADLS Gen2
π What Are Databricks Mounts?
Mounts are persistent mount points that link external cloud storage (like Azure Data Lake or Azure Blob) to a specific directory in DBFS.
Once mounted:
- You can use regular file system commands (
%fs ls,dbutils.fs.ls(), etc.) - You donβt need to re-authenticate or manage tokens repeatedly
- Mounts persist across sessions and clusters
π How to Mount an ADLS Gen2 Container to Databricks
Mounting ADLS requires proper authentication using either access keys or service principals.
π Example: Mount Using Access Key
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account>.dfs.core.windows.net/",
mount_point = "/mnt/mydata",
extra_configs = {
"fs.azure.account.key.<storage-account>.dfs.core.windows.net": "<access-key>"
}
)
π‘οΈ Example: Mount Using Service Principal
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<client-id>",
"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="myscope", key="client-secret"),
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account>.dfs.core.windows.net/",
mount_point = "/mnt/mydata",
extra_configs = configs
)
π§Ό How to Unmount a Container
dbutils.fs.unmount("/mnt/mydata")
π How to List Files
display(dbutils.fs.ls("/mnt/mydata"))
Or:
%fs ls /mnt/mydata
π Use Cases for Mounts
| Use Case | Mount Usage |
|---|---|
| Accessing Azure Data Lake | β Recommended |
| Long-term storage | β Recommended |
| Temporary scratchpad | β Use DBFS Root |
| Production pipelines | β Recommended |
| Shared datasets between teams | β Recommended |
π Summary
| Concept | Description |
|---|---|
| DBFS | Virtual file system in Databricks backed by Blob Storage |
| DBFS Root | Default internal storage β good for scratchpad, not for production data |
| Mounts | Permanent links to external storage (like ADLS) using /mnt/... path |
| Access Method | Use Access Keys or Service Principals (prefer secrets for security) |
| Best Practice | Secure credentials with Databricks Secret Scope or Azure Key Vault |