Mohammad Gufran Jahangir August 11, 2025 0

When working with Databricks, you will store and access data from various locations — some are Databricks-native (like DBFS), and others are external (like Azure Data Lake Storage via ABFSS).
Choosing the right storage access path impacts performance, cost, governance, and security.


1. DBFS (Databricks File System)

What is DBFS?

  • DBFS is a distributed file system built on top of cloud object storage (Azure Data Lake Storage, AWS S3, or GCP Cloud Storage).
  • It mounts cloud storage into Databricks so it appears like a local file system for easier access.
  • Provides a POSIX-like interface for file management inside Databricks.

Key Characteristics

  • Path format: dbfs:/mnt/<mount-name>/<path>
  • Fully integrated with Spark — can be accessed with spark.read, dbutils.fs, or Python’s file I/O.
  • You can mount external storage like Azure Data Lake Storage (ADLS) or Blob Storage into DBFS.

When to Use

  • When you want a simplified path structure and unified access to external storage inside Databricks.
  • For staging data during ETL before writing to final storage.

Example

Mount Azure Data Lake Storage into DBFS:

dbutils.fs.mount(
  source="abfss://raw@mydatalake.dfs.core.windows.net/",
  mount_point="/mnt/raw",
  extra_configs={"fs.azure.account.key.mydatalake.dfs.core.windows.net": "<storage-key>"}
)

# Access files
display(dbutils.fs.ls("/mnt/raw"))

2. ABFSS (Azure Blob File System Secure)

What is ABFSS?

  • abfss:// is the direct URI scheme for accessing Azure Data Lake Storage Gen2.
  • It bypasses DBFS mounting and directly uses Azure’s secure endpoint.
  • Recommended for production workloads where security and performance are priorities.

Key Characteristics

  • Path format: abfss://<container>@<storage-account>.dfs.core.windows.net/<path>
  • Works with OAuth / Managed Identity / Service Principal authentication.
  • Avoids some DBFS mounting limitations (like needing workspace admin to configure mounts).

When to Use

  • For secure, direct access to data in ADLS Gen2.
  • When implementing Unity Catalog — mounts in DBFS are discouraged for governed data.

Example

Read a Parquet file directly from ADLS Gen2:

df = spark.read.parquet(
    "abfss://raw@mydatalake.dfs.core.windows.net/2025/01/data.parquet"
)
df.show()

Using OAuth with Managed Identity:

spark.conf.set(
  "fs.azure.account.auth.type.mydatalake.dfs.core.windows.net",
  "OAuth"
)
spark.conf.set(
  "fs.azure.account.oauth.provider.type.mydatalake.dfs.core.windows.net",
  "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
)
spark.conf.set(
  "fs.azure.account.oauth2.client.id.mydatalake.dfs.core.windows.net",
  "<app-client-id>"
)
spark.conf.set(
  "fs.azure.account.oauth2.client.secret.mydatalake.dfs.core.windows.net",
  "<app-secret>"
)
spark.conf.set(
  "fs.azure.account.oauth2.client.endpoint.mydatalake.dfs.core.windows.net",
  "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
)

3. Other Location Schemes in Databricks

a) file://

  • Refers to the local file system on the driver node.
  • Exists only during the job’s execution.
  • Not recommended for distributed workloads since workers can’t see the driver’s local files.

Example:

df = spark.read.csv("file:/tmp/localfile.csv")

b) s3:// (AWS)

  • Direct access to Amazon S3 bucket.
  • Often used with IAM roles for authentication.

Example:

df = spark.read.json("s3://mybucket/data/2025/")

c) gs:// (GCP)

  • Direct access to Google Cloud Storage.
  • Requires service account authentication.

Example:

df = spark.read.parquet("gs://mybucket/data/")

d) wasbs:// (Azure Blob Storage – Legacy)

  • Used for Azure Blob Storage, not ADLS Gen2.
  • Legacy protocol — replaced by ABFSS for secure workloads.

4. DBFS vs ABFSS – Comparison Table

FeatureDBFSABFSS
Access TypeMounted via DatabricksDirect to ADLS
SecurityKey-based (less secure)OAuth / Managed Identity (secure)
PerformanceSlight overhead due to mount layerFaster direct access
GovernanceNot recommended for Unity CatalogFully compatible with Unity Catalog
Ease of UseEasier for dev/testRequires more config for auth

5. Best Practices

  • Use ABFSS for production and sensitive data.
  • Use DBFS mounts for dev/test environments or temporary storage.
  • Avoid file:// for distributed workloads.
  • Prefer OAuth/Managed Identity over storage keys for security compliance.
  • In Unity Catalog, do not mount data — access it directly via ABFSS or registered external locations.

Example – Reading from DBFS and ABFSS in the Same Notebook

# DBFS
df_dbfs = spark.read.csv("dbfs:/mnt/raw/2025/data.csv")

# ABFSS
df_abfss = spark.read.csv(
    "abfss://raw@mydatalake.dfs.core.windows.net/2025/data.csv"
)

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments