Azure Databricks provides a powerful platform for data analysis and processing, and connecting it to Azure SQL Database unlocks valuable insights from your relational data. Here’s a step-by-step guide to establish a connection and query your Azure SQL Database from Databricks:
Prerequisites
I have utilized the following three Azure Resources to complete this exercise:
- Azure Databricks Workspace: Ensure you have an active Databricks workspace.
- Azure SQL Database: You need an existing Azure SQL database with desired tables and data.
- Create an Azure Key Vault: Allow users to store secrets, such as database connection strings, securely.
- Network Connectivity: Both Databricks and SQL Database should be accessible within the same virtual network or with appropriate network peering configurations.
Create a Secret Scope
Azure Databricks has Key Vault-backed and Databricks-backed secret scopes. These secret scopes allow users to store secrets, such as database connection strings, securely. If someone tries to output a secret to a notebook, it is replaced by [REDACTED], which helps prevent someone from viewing the secret or accidentally leaking it when displaying or sharing the notebook.
Create a Databricks Cluster
1. Set Up Azure Databricks Cluster:
- Open your Azure Databricks workspace.
- Create or select a cluster:
- Go to the “Clusters” section.
- Create a new cluster or use an existing one.
click Clusters from the left-hand menu to create a cluster.
Create a Databricks Notebook
create a notebook to connect to SQL Database.
enter a name for my notebook, select python as language of choice and click Create.
Access Azure SQL Database
- Open a notebook in Azure Databricks.
- Choose the default language (e.g., Python or Scala).
Configure JDBC Connection
Load JDBC Driver:
- Load the JDBC driver for SQL Server:
%r
%classpath
--load your JDBC driver (e.g., com.microsoft.sqlserver:mssql-jdbc:9.4.0.jre8)
Define Connection Details:
- Define the connection details for Azure SQL Database:
jdbcHostname = "your_server_name.database.windows.net"
jdbcPort = 1433
jdbcDatabase = "your_database_name"
jdbcUsername = "your_username"
jdbcPassword = "your_password"
jdbcUrl = f"jdbc:sqlserver://{jdbcHostname}:{jdbcPort};database={jdbcDatabase};user={jdbcUsername};password={jdbcPassword}"
Replace placeholders with your Azure SQL Database credentials.
Read Data from Azure SQL Database
Read Table Data into a DataFrame:
- Use the
spark.read
method to read data from the SQL Database into a DataFrame:
df = spark.read.format("jdbc").option("url", jdbcUrl).option("dbtable", "your_table_name").load()
Replace "your_table_name"
with the name of the table you want to query.
Perform SQL Query:
- Run SQL queries directly against the database using Spark SQL:
df_filtered = spark.sql("SELECT * FROM your_table_name WHERE condition")
Replace "your_table_name"
with your table and add a condition if necessary.
Display and Analyze Data
Display Data:
- Use the
.show()
method to display the queried data:
df.show()
This command displays the entire DataFrame content.
Perform Data Analysis:
Utilize Spark’s functionalities for data analysis, transformations, aggregations, and visualizations on the DataFrames created.
Example:
%r
%classpath
--load your JDBC driver (e.g., com.microsoft.sqlserver:mssql-jdbc:9.4.0.jre8)
jdbcHostname = "your_server_name.database.windows.net"
jdbcPort = 1433
jdbcDatabase = "your_database_name"
jdbcUsername = "your_username"
jdbcPassword = "your_password"
jdbcUrl = f"jdbc:sqlserver://{jdbcHostname}:{jdbcPort};database={jdbcDatabase};user={jdbcUsername};password={jdbcPassword}"
df = spark.read.format("jdbc").option("url", jdbcUrl).option("dbtable", "your_table_name").load()
df.show()
This code establishes a connection, reads data from Azure SQL Database into a DataFrame, and displays the DataFrame content in Databricks. Adapt this process based on your Azure SQL Database setup and the analysis or manipulation you intend to perform with the data.
Additional Tips:
- Consider using secret management tools like Azure Key Vault to securely store your password instead of directly referencing it in your code.
- Utilize Spark DataFrames and SQL functions to manipulate and analyze data retrieved from Azure SQL Database.
- Explore advanced features like Delta tables for optimized performance and ACID transactions when interacting with your SQL database from Databricks.
By following these steps and utilizing the provided examples, you can effectively connect and query Azure SQL Database from within Azure Databricks. This expands your data analysis capabilities and enables you to gain valuable insights from your relational databases alongside other data sources within the Databricks environment.
Remember: Secure your connection details, keep your JDBC driver updated, and explore additional capabilities to maximize the potential of this powerful data integration.