Mohammad Gufran Jahangir April 20, 2025 0

Table of Contents

🗂️ Databricks File System (DBFS) and Mounting Azure Data Lake Containers

Databricks offers a virtual distributed file system called DBFS which simplifies data access for notebooks, jobs, and clusters. To integrate external cloud storage like Azure Data Lake (ADLS Gen2), mounts are used — providing a convenient way to navigate and work with cloud data as if it’s part of a local filesystem.

Let’s dive into what DBFS is, how mounts work, and how to securely mount ADLS containers.

📦 What is Databricks File System (DBFS)?

DBFS is a distributed file system built on top of Azure Blob Storage, enabling seamless access to data stored in the Databricks workspace.

📌 Key Characteristics:

DBFS is mounted into the Databricks workspace root.
It acts as a unified interface for both ephemeral (cluster-local) and persistent storage.
Can be accessed from notebooks, clusters, and jobs.
Includes directories like /mnt, /databricks, /dbfs (depending on context).

🔍 DBFS Architecture Overview

Control Plane

Managed by Databricks
Includes workspace UX, cluster manager, job scheduler

Data Plane

Resides in your Azure subscription
Includes VMs, Spark compute, and actual storage access
ADLS and Azure Blob act as the backend

📁 What is DBFS Root?

The DBFS Root (e.g., /dbfs/) is the default storage for files in a Databricks workspace, backed by Azure Blob Storage.

✅ Features:

Accessible via the Databricks Web UI (e.g., Data tab > DBFS)
Ideal for notebooks, libraries, input/output files
Used by default when saving data temporarily

❗ Considerations:

Not recommended for long-term production storage
Use mounts instead for accessing external storage like ADLS Gen2

🔗 What Are Databricks Mounts?

Mounts are persistent mount points that link external cloud storage (like Azure Data Lake or Azure Blob) to a specific directory in DBFS.

Once mounted:

You can use regular file system commands (%fs ls, dbutils.fs.ls(), etc.)
You don’t need to re-authenticate or manage tokens repeatedly
Mounts persist across sessions and clusters

🚀 How to Mount an ADLS Gen2 Container to Databricks

Mounting ADLS requires proper authentication using either access keys or service principals.

🔑 Example: Mount Using Access Key

dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account>.dfs.core.windows.net/",
  mount_point = "/mnt/mydata",
  extra_configs = {
    "fs.azure.account.key.<storage-account>.dfs.core.windows.net": "<access-key>"
  }
)

🛡️ Example: Mount Using Service Principal

configs = {
  "fs.azure.account.auth.type": "OAuth",
  "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id": "<client-id>",
  "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="myscope", key="client-secret"),
  "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}

dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account>.dfs.core.windows.net/",
  mount_point = "/mnt/mydata",
  extra_configs = configs
)

🧼 How to Unmount a Container

dbutils.fs.unmount("/mnt/mydata")

🔍 How to List Files

display(dbutils.fs.ls("/mnt/mydata"))

Or:

%fs ls /mnt/mydata

📊 Use Cases for Mounts

Use Case	Mount Usage
Accessing Azure Data Lake	✅ Recommended
Long-term storage	✅ Recommended
Temporary scratchpad	❌ Use DBFS Root
Production pipelines	✅ Recommended
Shared datasets between teams	✅ Recommended

📘 Summary

Concept	Description
DBFS	Virtual file system in Databricks backed by Blob Storage
DBFS Root	Default internal storage – good for scratchpad, not for production data
Mounts	Permanent links to external storage (like ADLS) using `/mnt/...` path
Access Method	Use Access Keys or Service Principals (prefer secrets for security)
Best Practice	Secure credentials with Databricks Secret Scope or Azure Key Vault

Mohammad Gufran Jahangir

Category: