Azure Databricks Tutorial: How to work with Azure Databricks

Mohammad Gufran Jahangir November 27, 2023 0

Working with Azure Databricks involves various tasks related to data engineering, data analysis, and machine learning. Below, I’ll provide a step-by-step guide with examples for common tasks in Azure Databricks. Please note that you need to have an Azure Databricks workspace set up as described in the previous response before proceeding with these steps.

Log in to the Azure Portal.
Navigate to your Azure Databricks workspace.
Click on your workspace to access it.

The landing page of the Databricks Workspace should look like this:

On the left-hand side of the page you should see the navigation pane. This is pinned to the workspace, and will always be accessible no matter where you are in the workspace.

Home: Takes you to your own personal folder within the workspace, where you can store your own notebooks.
Workspace: Brings up a ‘file system’ where you can organize your notebooks into logical groupings. By default, there is a ‘Shared’ folder and a ‘Users’ folder. The ‘Shared’ folder is where collaborative notebooks can be stored, and the ‘Users’ folder will have a folder for each user who has access to the workspace. You can set permissions on these directories to control who has access to what locations, for cases where there is sensitive information you need to secure.
Recents: Brings up a list of recently accessed notebooks that you have been working on.
Data: Shows data you have declared within Databricks (databases and tables). By default, you will need to create a running cluster to see any data here.
Clusters: The page where you can create, modify, and maintain Spark clusters in a simple GUI.
Jobs: The place where you can see all configured jobs and job runs. A Job is a notebook set to run based on a trigger (via a REST call, on a schedule, or invoked via Azure Data Factory, just to name a few).
Models: Using ML flow, you can manage deployed machine learning models through this interface.
Search: A search module for your workspace.

Step 2: Create a Databricks Cluster
A Databricks cluster is a group of virtual machines where your Spark jobs run.

On the main screen of the workspace, you will see three columns. These are quick links to jump right into coding:

The first column has a list of common tasks so that you can jump right into whatever you would like to work on – whether that is creating a new notebook, new cluster, new job, etc. There is also a QuickStart tutorial that will help you create a cluster and query some pre-loaded data.

The middle column has a quick link to load data, and below that displays recent notebooks you have worked on.

The last column has a quick link to create a new notebook as well as links to the key Databricks documentation.

In the Azure Databricks workspace, click on the “Clusters” option in the left sidebar.
Click the “Create Cluster” button.
Configure your cluster settings, including the cluster name, type, and the number of worker nodes.
Click “Create Cluster.”

Creating a Simple Cluster

The next step is to create a simple cluster, which will act as the ‘compute’ for any of the code we are writing. This is the power of the modern cloud computing with massive parallel processing (MPP) systems – separation of compute and storage. You can scale these separately and they have no bearing on each other, allowing you to scale much quicker and more efficiently.

The beauty of creating clusters in Databricks is that it is as easy as filling out a few key fields and clicking ‘Create’. The Spark cluster is built and configured on Azure VMs in the background and is nearly infinitely scalable if you need more power. Setting up your own custom Spark cluster is difficult, and tedious at best. Databricks abstracts this, and manages all of the dependencies, updates, and backend configurations so that you can focus on coding.

On the home page of your workspace, click the ‘New Cluster’ link to get started.

Exploring the options for creating clusters could be a tip unto itself, so for now let’s fill out the options with some simple configurations, as seen below:Loading a CSV of Sample Data

Once you have filled everything out, click ‘Create Cluster’ in the top right-hand corner, and Databricks will take care of the rest!

The create process takes around 5-10 minutes, and once complete, you should see the following screen, with a green dot next to your cluster, indicating that it is active.

To terminate, start, and manage your cluster, you can click on it, which brings you into the cluster management screen as follows:

Remember, we set our ‘Terminate after’ field to 20 minutes, so if the cluster is idle for 20 minutes, it will auto shutdown. This is to save money, because you are paying for every minute the cluster is running!

Loading a CSV of Sample Data

The next step is to load some sample data

Download some data files from some website and than Next, navigate back to your home screen in Databricks using the Azure Databricks icon in the top left-hand corner of the screen:

From here, click ‘click to browse’ to instantly upload some data. There are many ways to import data into Databricks, but for now we are going to use the GUI approach.

Navigate to the CSV and click ‘open’.

This will bring you to a ‘Create new table’ page. We will be uploading the data to ‘DBFS’, which stands for Databricks File System. DBFS is auto-configured storage backed by Azure Blob (binary large object storage) for storing basic data for access in Databricks. The data source should already be filled out for you, and you can specify your own folder path in DBFS if you want. For now, just click ‘Create Table with UI’.

prompted to pick a cluster to preview the table – select the cluster you just created and click ‘Preview Table’.

This brings up a preview of the data you are about to load with options to specify table attributes on the left-hand side. Fill out the fields as shown here. You will notice that we now have the proper header column and data types. One you are happy with how the table looks, click ‘Create Table’.

This should bring you to a page showing the schema of the new table, and some sample data. You can now also see the table by navigating the ‘Data’ tab on the navigation pane on the left. You will see the data was added to our default database:

Step 3: Create a Notebook

In the Azure Databricks workspace, click on the “Workspace” option in the left sidebar.
Under the “Workspace” tab, click on the folder where you want to create your notebook.
Click the “Create” button, then select “Notebook.”
Give your notebook a name and choose the default language (e.g., Python, Scala, R, SQL).
Click “Create” to create the notebook.

Step 4: Write Spark Code in a Notebook
Let’s write a simple Spark code example in a Python notebook:
python Copy code
# Import the SparkSession module
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName(“example”).getOrCreate()

# Create a DataFrame
data = [(“Alice”, 34), (“Bob”, 45), (“Charlie”, 29)]
df = spark.createDataFrame(data, [“Name”, “Age”])

# Show the DataFrame
df.show()

Step 5: Run Code in a Notebook

In your notebook, write or paste the Spark code.
Click the “Run” button to execute the code. The output will be displayed below the code cell.

Step 6: Data Exploration and Transformation
Let’s perform some data exploration and transformation on the DataFrame:
pythonCopy code
# Display the schema of the DataFrame
df.printSchema()

# Perform a SQL-like query
df.select(“Name”, “Age”).filter(df.Age > 30).show()

# Group by age and count
df.groupBy(“Age”).count().show()

Step 7: Visualize Data
You can create visualizations using libraries like Matplotlib, Seaborn, or display built-in Databricks visualizations. Here’s an example using Matplotlib:
pythonCopy code
import matplotlib.pyplot as plt

# Create a bar chart of ages
age_counts = df.groupBy(“Age”).count().orderBy(“Age”)
age_counts_pd = age_counts.toPandas()

plt.bar(age_counts_pd[“Age”], age_counts_pd[“count”])
plt.xlabel(“Age”)
plt.ylabel(“Count”)
plt.title(“Age Distribution”)
plt.show()

Step 8: Data Ingestion and Storage

You can read and write data from various storage sources like Azure Data Lake Storage, Azure Blob Storage, and more. Here’s an example of reading data from a CSV file:
pythonCopy code
# Read data from a CSV file
csv_file_path = “dbfs:/FileStore/tables/sample_data.csv”
df = spark.read.csv(csv_file_path, header=True, inferSchema=True)

# Show the DataFrame
df.show()

Step 9: Machine Learning
Azure Databricks supports machine learning with MLlib and other libraries. Here’s a simple example of linear regression:
pythonCopy code
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

# Prepare the data
assembler = VectorAssembler(inputCols=[“Feature1”, “Feature2″], outputCol=”features”)
df = assembler.transform(df)

# Create a linear regression model
lr = LinearRegression(featuresCol=”features”, labelCol=”Label”)
model = lr.fit(df)

# Make predictions
predictions = model.transform(df)
predictions.show()

Step 10: Saving and Sharing Notebooks

To save your notebook, click the “File” menu and select “Save.”
You can share your notebook by clicking the “Share” button and entering the email addresses of collaborators.

These steps cover some common tasks in Azure Databricks, but the platform offers a wide range of capabilities for big data processing, analytics, and machine learning. You can explore more advanced topics like job scheduling, integration with other Azure services, and building end-to-end data pipelines as you become more familiar with the platform.

Mohammad Gufran Jahangir

Tags: Azure Databricks Tutorial: How to work with Azure Databricks

Category:

Azure Databricks

Azure Databricks Tutorial: How to work with Azure Databricks – Part 3