Azure Data Lake Storage Gen2 (ADLS Gen2) is a highly scalable and secure storage solution for big data in Azure. Integrating it with Azure Databricks unlocks powerful data processing capabilities. Here’s a step-by-step guide with examples to read and write data in ADLS Gen2 using Databricks:
Prerequisites:
- Active Azure Databricks workspace: Ensure you have a running Databricks workspace with sufficient resources.
- ADLS Gen2 account: You need an existing ADLS Gen2 account with desired data folders and files.
- Permissions: Your Databricks user or service principal must have read/write permissions on the ADLS Gen2 container/files.
Create a Basic ADLS Gen 2 Data Lake and Load in Some Data
The first step in our process is to create the ADLS Gen 2 resource in the Azure Portal that will be our Data Lake for this walkthrough.
Navigate to the Azure Portal, and on the home screen click ‘Create a resource’.
Search for ‘Storage account’, and click on ‘Storage account – blob, file, table, queue’.
Click ‘Create’.
Make sure the proper subscription is selected – this should be the subscription where you have the free credits. Next select a resource group.
If you do not have an existing resource group to use – click ‘Create new’. A resource group is a logical container to group Azure resources together. Name it something such as ‘intro-databricks-rg’.
Next, pick a Storage account name. This must be a unique name globally so pick something like ‘adlsgen2demodatalake123’.
Pick a location near you or use whatever is default. Keep ‘Standard’ performance for now and select ‘StorageV2’ as the ‘Account kind’. For ‘Replication’, select ‘Locally-redundant storage’. Finally, keep the access tier as ‘Hot’.
Your page should look something like this:
Set Up Azure Databricks Cluster:
- Open your Azure Databricks workspace.
- Create or select a cluster:
- Go to the “Clusters” section.
- Create a new cluster or use an existing one.
Access Azure Data Lake Storage Gen2:
- Open a notebook in Azure Databricks.
- Choose the default language (e.g., Python or Scala).
Mount ADLS Gen2 Storage
- Generate Access Keys:
- Obtain the storage account name and access key from the Azure portal for ADLS Gen2.
- Mount the Storage:
- Mount the ADLS Gen2 storage to access it from Databricks:
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<client_id>",
"fs.azure.account.oauth2.client.secret": "<client_secret>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant_id>/oauth2/token"
}
# Mount the ADLS Gen2 account to Databricks
dbutils.fs.mount(
source="abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/",
mount_point="/mnt/<mount_name>",
extra_configs=configs
)
Replace placeholders (<...>
) with your Azure ADLS Gen2 account details.
Read and Write Data
Read Data from ADLS Gen2:
- Use Databricks to read data:\
df = spark.read.csv("/mnt/<mount_name>/<path_to_file>.csv")
Replace <path_to_file>
with the path to your CSV file.
Write Data to ADLS Gen2
- Save a DataFrame to ADLS Gen2:
df.write.mode("overwrite").csv("/mnt/<mount_name>/output_folder")
Replace output_folder
with the desired folder name.
Example
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<client_id>",
"fs.azure.account.oauth2.client.secret": "<client_secret>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant_id>/oauth2/token"
}
# Mount the ADLS Gen2 account to Databricks
dbutils.fs.mount(
source="abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/",
mount_point="/mnt/<mount_name>",
extra_configs=configs
)
# Read data from ADLS Gen2
df = spark.read.csv("/mnt/<mount_name>/<path_to_file>.csv")
df.show()
# Write data to ADLS Gen2
df.write.mode("overwrite").csv("/mnt/<mount_name>/output_folder")
This process allows you to access and manipulate data stored in ADLS Gen2 directly from Azure Databricks. Customize the paths, file formats, and operations based on your specific requirements and data formats in Azure Data Lake Storage Gen2. Adjust permissions and security settings in ADLS Gen2 accordingly for secure access from Databricks.
Advanced Techniques
- Utilize Delta Lake format for ACID transactions and optimized performance when writing data.
- Leverage Spark SQL joins, aggregations, and transformations to analyze and manipulate data before writing it to ADLS Gen2.
- Explore Spark MLlib for machine learning and data science tasks on large datasets stored in ADLS Gen2.
Security and Best Practices
- Securely store your service principal credentials or leverage managed identity for access control.
- Use appropriate access control lists (ACLs) on ADLS Gen2 containers and files for granular permission control.
- Monitor data access and activity in Databricks and ADLS Gen2 for security auditing.
Remember:
- Choose the appropriate data formats and Spark functions based on your specific data needs and analysis goals.
- Implement robust security practices and maintain data integrity throughout your data processing workflows.