,

How to mount cloud object storage (ADLS) on Azure Databricks

Posted by

Azure Databricks is like a special key that helps you connect this locker to a cloud storage space. So, you can easily access your stuff stored in the cloud without knowing much about how the cloud works.

Now, when you connect your locker to the cloud using Azure Databricks, you create something called a mount. But here’s the catch: this special key doesn’t work well with another system called Unity Catalog. Unity Catalog helps organize and manage your data.

So, instead of using this special key (mount) to connect your locker to the cloud, Azure Databricks suggests using Unity Catalog to keep things organized. It’s like saying, “Hey, let’s stop using this old key and use this new, better system for managing our stuff.

How does Azure Databricks mount cloud object storage?

  1. What Azure Databricks mounts do:
    • They create a connection between a workspace (where you work on data) and storage in the cloud (where data is stored).
    • This connection lets you use the cloud storage like it’s a part of your workspace, using paths that you’re already familiar with.
  2. How mounts work:
    • They make a local copy (alias) of the cloud storage in a special directory called /mnt.
    • This alias stores three main things:
      • Where the cloud storage is located.
      • Specifications on how to connect to the storage (like which technology to use).
      • Credentials (like passwords) needed to access the data securely.

So, in simple terms, Azure Databricks mounts help you easily use cloud storage as if it’s just another part of your workspace, by creating a special link and storing important information about the connection in a specific directory.

Syntax for mounting storage

The source specifies the URI of the object storage (and can optionally encode security credentials). The mount_point specifies the local path in the /mnt directory. Some object storage sources support an optional encryption_type argument. For some access patterns you can pass additional configuration specifications as a dictionary to extra_configs.

Databricks recommends setting mount-specific Spark and Hadoop configuration as options using extra_configs. This ensures that configurations are tied to the mount rather than the cluster or session.

dbutils.fs.mount(
  source: str,
  mount_point: str,
  encryption_type: Optional[str] = "",
  extra_configs: Optional[dict[str:str]] = None
)

Unmount a mount point

To unmount a mount point, use the following command:

dbutils.fs.unmount("/mnt/<mount-name>")

How to Mount ADLS Gen2 or Blob Storage with ABFS

we can mount data in an Azure storage account using a Microsoft Entra ID (formerly Azure Active Directory) application service principal for authentication.

  1. Users in Azure Databricks can access a mounted ADLS Gen2 account.
  2. Service principal used to access ADLS Gen2 should only have access to that account, not other Azure resources.
  3. Mounting a point through a cluster allows immediate access for cluster users; to use it in another running cluster, use dbutils.fs.refreshMounts().
  4. Unmounting a point while jobs are running can cause errors; avoid unmounting during production.
  5. Mount points using secrets aren’t automatically refreshed; changing secrets can cause errors like “401 Unauthorized”, requiring unmounting and remounting.
  6. To mount an Azure Data Lake Storage Gen2 account using the ABFS endpoint, Hierarchical Namespace (HNS) must be enabled.

Run the following in your notebook to authenticate and create a mount point.

For python
configs = {"fs.azure.account.auth.type": "OAuth",
          "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
          "fs.azure.account.oauth2.client.id": "<application-id>",
          "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
          "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)
    For Scala
    val configs = Map(
      "fs.azure.account.auth.type" -> "OAuth",
      "fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
      "fs.azure.account.oauth2.client.id" -> "<application-id>",
      "fs.azure.account.oauth2.client.secret" -> dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
      "fs.azure.account.oauth2.client.endpoint" -> "https://login.microsoftonline.com/<directory-id>/oauth2/token")
    // Optionally, you can add <directory-name> to the source URI of your mount point.
    dbutils.fs.mount(
      source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
      mountPoint = "/mnt/<mount-name>",
      extraConfigs = configs)

    Replace

    • <application-id> with the Application (client) ID for the Azure Active Directory application.
    • <scope-name> with the Databricks secret scope name.
    • <service-credential-key-name> with the name of the key containing the client secret.
    • <directory-id> with the Directory (tenant) ID for the Azure Active Directory application.
    • <container-name> with the name of a container in the ADLS Gen2 storage account.
    • <storage-account-name> with the ADLS Gen2 storage account name.
    • <mount-name> with the name of the intended mount point in DBFS.

     How Do I Access My Data Stored In a Cloud Object Storage Using Mount Points?

    Once mounted, accessing your data (e.g., Delta Table) is as straightforward as referencing the mount point in your data operations:

    # Using spark, read delta table by the path
    df = spark.read.load("/mnt/my_mount_point/my_data")
    
    # Using spark, write back to the mount point
    df.write.format("delta").mode("overwrite").save("/mnt/my_mount_point/delta_table")

    Why and When Do You Need Mount Points?

    Using mount points was the general practice for accessing cloud object storage before the unity catalog was introduced.

    • You want to access your cloud object storage as if it is on DBFS
    • Unity Catalog is not activated in your workspace
    • Your cluster runs on a Databricks runtime (DBR) version older than 11.3 LTS
    • You have no access to a premium workspace plan (i.e., Standard plan)
    • If you want to avoid mount points and still can not use Unity Catalog (UC), you can set your Service Principal (SP) credentials in the spark configuration and access the ADLS Gen2 containers as well.

    When Should You Use Unity Catalog Instead of Mount Points?

    • The above conditions don’t apply to you.
    • You can use cluster with a later DBR version (>= 11.3 LTS) and have access to premium plan
    • Mounted data doesn’t work with Unity Catalog.
      – However, you can still see your tables and their referenced mount point paths in the old hive_metastore catalog if you migrated to UC.

    How to list all the mount points in Azure Databricks?

    display(dbutils.fs.mounts())

    You can simply use the Databricks filesystem commands to navigate through the mount points available in your cluster.

    %fs
    mounts

    Here is the list of commands to list all the mount points in Azure Databricks.

    Best Practices for Using Mount Points

    • When doing mounting operations, manage your secrets using secret scopes and never expose raw secrets
    • Keep your mount points up-to-date
      – In case a source doesn’t exist anymore in the storage account, remove the mount points from Databricks as well
    • Using the same mount point name as your container name can make things easier if you have many mount points. Especially, if you come back to your workspace after some time, you can easily match them with the Azure Storage Explorer.
    • Don’t put non-mount point folders and other files in the /mnt/ directory. They will confuse you.
    • If your SP credentials get updated, you might have to remount your all mount points again:
      – You can loop through the mount points if all the mount points are still pointing to existing sources.
      – Otherwise, you will get AAD exceptions and have to manually try unmounting and mounting each mount point.
    • If you can, use Unity Catalog (UC) instead of mount points for better data governance, centralized metadata management, fine-grained security controls and a unified data catalog across different Databricks workspaces.
    guest
    0 Comments
    Inline Feedbacks
    View all comments
    0
    Would love your thoughts, please comment.x
    ()
    x