Understanding DBFS, ABFSS, and Other Storage Locations in Databricks

Mohammad Gufran Jahangir August 11, 2025 0

When working with Databricks, you will store and access data from various locations — some are Databricks-native (like DBFS), and others are external (like Azure Data Lake Storage via ABFSS).
Choosing the right storage access path impacts performance, cost, governance, and security.

Table of Contents

1. DBFS (Databricks File System)

What is DBFS?

DBFS is a distributed file system built on top of cloud object storage (Azure Data Lake Storage, AWS S3, or GCP Cloud Storage).
It mounts cloud storage into Databricks so it appears like a local file system for easier access.
Provides a POSIX-like interface for file management inside Databricks.

Key Characteristics

Path format: dbfs:/mnt/<mount-name>/<path>
Fully integrated with Spark — can be accessed with spark.read, dbutils.fs, or Python’s file I/O.
You can mount external storage like Azure Data Lake Storage (ADLS) or Blob Storage into DBFS.

When to Use

When you want a simplified path structure and unified access to external storage inside Databricks.
For staging data during ETL before writing to final storage.

Example

Mount Azure Data Lake Storage into DBFS:

dbutils.fs.mount(
  source="abfss://raw@mydatalake.dfs.core.windows.net/",
  mount_point="/mnt/raw",
  extra_configs={"fs.azure.account.key.mydatalake.dfs.core.windows.net": "<storage-key>"}
)

# Access files
display(dbutils.fs.ls("/mnt/raw"))

2. ABFSS (Azure Blob File System Secure)

What is ABFSS?

abfss:// is the direct URI scheme for accessing Azure Data Lake Storage Gen2.
It bypasses DBFS mounting and directly uses Azure’s secure endpoint.
Recommended for production workloads where security and performance are priorities.

Key Characteristics

Path format: abfss://<container>@<storage-account>.dfs.core.windows.net/<path>
Works with OAuth / Managed Identity / Service Principal authentication.
Avoids some DBFS mounting limitations (like needing workspace admin to configure mounts).

When to Use

For secure, direct access to data in ADLS Gen2.
When implementing Unity Catalog — mounts in DBFS are discouraged for governed data.

Example

Read a Parquet file directly from ADLS Gen2:

df = spark.read.parquet(
    "abfss://raw@mydatalake.dfs.core.windows.net/2025/01/data.parquet"
)
df.show()

Using OAuth with Managed Identity:

spark.conf.set(
  "fs.azure.account.auth.type.mydatalake.dfs.core.windows.net",
  "OAuth"
)
spark.conf.set(
  "fs.azure.account.oauth.provider.type.mydatalake.dfs.core.windows.net",
  "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
)
spark.conf.set(
  "fs.azure.account.oauth2.client.id.mydatalake.dfs.core.windows.net",
  "<app-client-id>"
)
spark.conf.set(
  "fs.azure.account.oauth2.client.secret.mydatalake.dfs.core.windows.net",
  "<app-secret>"
)
spark.conf.set(
  "fs.azure.account.oauth2.client.endpoint.mydatalake.dfs.core.windows.net",
  "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
)

3. Other Location Schemes in Databricks

a) file://

Refers to the local file system on the driver node.
Exists only during the job’s execution.
Not recommended for distributed workloads since workers can’t see the driver’s local files.

Example:

df = spark.read.csv("file:/tmp/localfile.csv")

b) s3:// (AWS)

Direct access to Amazon S3 bucket.
Often used with IAM roles for authentication.

Example:

df = spark.read.json("s3://mybucket/data/2025/")

c) gs:// (GCP)

Direct access to Google Cloud Storage.
Requires service account authentication.

Example:

df = spark.read.parquet("gs://mybucket/data/")

d) wasbs:// (Azure Blob Storage – Legacy)

Used for Azure Blob Storage, not ADLS Gen2.
Legacy protocol — replaced by ABFSS for secure workloads.

4. DBFS vs ABFSS – Comparison Table

Feature	DBFS	ABFSS
Access Type	Mounted via Databricks	Direct to ADLS
Security	Key-based (less secure)	OAuth / Managed Identity (secure)
Performance	Slight overhead due to mount layer	Faster direct access
Governance	Not recommended for Unity Catalog	Fully compatible with Unity Catalog
Ease of Use	Easier for dev/test	Requires more config for auth

5. Best Practices

Use ABFSS for production and sensitive data.
Use DBFS mounts for dev/test environments or temporary storage.
Avoid file:// for distributed workloads.
Prefer OAuth/Managed Identity over storage keys for security compliance.
In Unity Catalog, do not mount data — access it directly via ABFSS or registered external locations.

✅ Example – Reading from DBFS and ABFSS in the Same Notebook

# DBFS
df_dbfs = spark.read.csv("dbfs:/mnt/raw/2025/data.csv")

# ABFSS
df_abfss = spark.read.csv(
    "abfss://raw@mydatalake.dfs.core.windows.net/2025/data.csv"
)

Mohammad Gufran Jahangir

Tags: Databricks

Category:

Databricks