Creating a notebook in Azure Databricks and querying data in multiple ways using SQL and Python.

Creating a Notebook:

Step-by-Step Guide:

  1. Accessing Azure Databricks:
    • Log in to your Azure portal.
    • Navigate to your Azure Databricks resource.
  2. Creating a Notebook:
    • Click on “Workspace” in the left-hand panel.
    • Select the folder where you want to create the notebook.
    • Click on “Create” and choose “Notebook”.
    • Give your notebook a name and choose the language (Python, Scala, SQL).
    • Click “Create”.

Querying Data in Different Ways:

1. Using SQL:

  1. Accessing Data:
    • In the notebook, select the first cell.
    • Use %sql to indicate SQL code execution.
  2. Querying a Table:
    • SQL queries can access data from existing tables.
    • Example:
%sql
SELECT * FROM my_table

2. Using Python:

  1. Accessing Data:
    • In the next cell, switch to Python language.
  2. Querying Data with PySpark:
    • PySpark DataFrames help query and manipulate data.
    • Example:
# Read data into a DataFrame
df = spark.read.format("csv").load("/FileStore/your_data_path.csv")

# Show the first few rows
df.show()

# Perform data manipulations or filtering
df_filtered = df.filter(df['column_name'] == 'value')

# Display the filtered data
df_filtered.show()

Visualization:

  • Visualizations can be created using libraries like Matplotlib or Pandas.
  • Example:
import matplotlib.pyplot as plt

# Create a bar chart
plt.bar(df_filtered.select('column_name').collect(), df_filtered.select('value').collect())
plt.xlabel('Column')
plt.ylabel('Value')
plt.title('Visualization')
plt.show()

Running the Notebook:

  • Click on “Run All” or execute cells individually to see results.

Explanation:

  • Notebook Creation: Creating a notebook in Azure Databricks provides a workspace for coding in different languages (Python, Scala, SQL).
  • SQL Queries: The %sql command allows direct execution of SQL queries against registered tables in the workspace, providing easy access to data.
  • Python & PySpark: Python cells allow you to use PySpark DataFrames for more complex data manipulation and analysis.
  • Data Visualization: Visualizations are generated using Python libraries like Matplotlib, enhancing data interpretation.
  • Execution: Executing cells runs the code, displaying outputs like tables, query results, or visualizations in the notebook interface.

Using Matplotlib and Pandas for Visualizations

Using Matplotlib for Visualizations in Python:

Importing Matplotlib:

  • In a Python cell of your Azure Databricks notebook, import the Matplotlib library:
import matplotlib.pyplot as plt

Data Preparation:

  • Prepare your data or use a DataFrame obtained from a previous query or analysis:
# Example DataFrame
data = {'Category': ['A', 'B', 'C', 'D'],
        'Values': [10, 20, 15, 25]}

Creating a Plot:

  • Use Matplotlib functions to create a plot (e.g., bar chart, line plot):
# Bar chart
plt.bar(data['Category'], data['Values'])
plt.xlabel('Category')
plt.ylabel('Values')
plt.title('Bar Chart Example')

Displaying the Plot:

  • Use plt.show() to display the plot in the notebook:
plt.show()

Note about Matplotlib

Matplotlib is a comprehensive library in Python used for creating static, interactive, and publication-quality visualizations. It’s a powerful tool for data visualization, providing a wide array of plots and charts to represent data in various formats.

Key Features of Matplotlib:

  1. Versatile Plots: Matplotlib allows the creation of a wide range of plots, including:
    • Line plots
    • Bar charts
    • Scatter plots
    • Histograms
    • Pie charts
    • Box plots
    • Heatmaps, and more.
  2. Customization: It offers extensive customization options for fine-tuning plots, including colors, markers, line styles, axes, labels, titles, legends, and annotations.
  3. Support for Different Formats: Matplotlib supports multiple file formats for saving plots, such as PNG, PDF, SVG, and more, ensuring compatibility for various purposes.
  4. Integration with Jupyter Notebooks: Matplotlib seamlessly integrates with Jupyter Notebooks, making it ideal for interactive data analysis and visualization.
  5. Backend Support: It provides different backend options, allowing users to display plots in various environments, including web applications, GUI toolkits, and more.
  6. Rich Ecosystem: Matplotlib’s ecosystem includes tools like Seaborn, Pandas, and NumPy, complementing its functionality for statistical visualization and data analysis.

Example of Matplotlib:

Here’s an example of creating a simple bar chart using Matplotlib:

import matplotlib.pyplot as plt

# Data
categories = ['Category A', 'Category B', 'Category C']
values = [20, 35, 30]

# Creating a bar chart
plt.bar(categories, values)
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Simple Bar Chart')

# Displaying the plot
plt.show()

Use Cases:

  • Exploratory data analysis
  • Presenting insights to stakeholders
  • Academic research
  • Scientific data visualization
  • Creating publication-quality figures for reports or publications.

Using Pandas for Visualizations with DataFrames:

Importing Pandas:

  • Import Pandas for DataFrame manipulation and visualization:
import pandas as pd

Data Preparation:

  • Use a DataFrame or prepare your data for visualization:
# Example DataFrame creation
df = pd.DataFrame(data, columns=['Category', 'Values'])

Creating a Plot using Pandas:

  • Utilize Pandas’ built-in plotting functions:
# Bar chart
df.plot(x='Category', y='Values', kind='bar', title='Bar Chart Example')

Displaying the Plot:

  • The plot gets displayed by executing the cell in the notebook.

Note about Pandas

Pandas is a popular open-source Python library used for data manipulation and analysis. It provides powerful data structures and tools for cleaning, transforming, and analyzing structured data. Pandas is widely used in data science, machine learning, and research due to its simplicity and efficiency in handling tabular, time-series, and heterogeneous data.

Key Features of Pandas:

  1. DataFrame & Series: Pandas introduces two main data structures:
    • DataFrame: A 2-dimensional labeled data structure resembling a table with rows and columns, similar to a spreadsheet or SQL table.
    • Series: A one-dimensional labeled array capable of holding various data types.
  2. Data Manipulation & Cleaning: Pandas offers a plethora of functions and methods for:
    • Reading and writing data from various file formats (CSV, Excel, SQL, JSON, etc.).
    • Handling missing data and performing data imputation.
    • Filtering, sorting, and transforming data.
    • Combining and merging datasets.
  3. Descriptive Statistics & Aggregation: Pandas allows for quick computation of descriptive statistics (mean, median, standard deviation, etc.) and aggregations over data frames.
  4. Time Series Analysis: Pandas provides specialized tools for working with time series data, including date range generation, shifting, and frequency conversion.
  5. Visualization: While not a primary visualization library, Pandas integrates with Matplotlib to create simple plots directly from DataFrames.
  6. Integration with NumPy: Pandas is built on top of NumPy, utilizing its powerful array-based computation, making it efficient for handling large datasets.

Example of Pandas:

Here’s an example showcasing basic operations with Pandas:

import pandas as pd

# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'San Francisco', 'Los Angeles']}

df = pd.DataFrame(data)

# Displaying the DataFrame
print(df)

# Basic statistics
print(df.describe())

# Filtering data
filtered_df = df[df['Age'] > 25]
print(filtered_df)

Use Cases:

  • Data cleaning and preprocessing
  • Exploratory data analysis (EDA)
  • Time series analysis
  • Statistical analysis and modeling
  • Data preparation for machine learning tasks.

Explanation:

  • Library Import: Import Matplotlib or Pandas at the beginning of the cell to access their functionalities for visualization.
  • Data Preparation: Prepare data in the desired format, either from previous analyses or creating new DataFrames for visualization.
  • Creating a Plot: Use Matplotlib functions or Pandas’ built-in plotting methods to create various types of visualizations like bar charts, line plots, scatter plots, etc.
  • Displaying the Plot: The plt.show() function for Matplotlib or executing the Pandas plot command displays the visualization within the notebook interface.

Building a Simple Chart

Finally, lets create a simple chart using the data from this data set. First, re-run the select * script from before:

Next, click the bar chart icon in the bottom left hand corner:

Then, click ‘Plot Options’:

Select ‘Bar chart’ as the display type, move the ‘province_state’ field into the Key field, and the ‘recovered’ field into values, and click ‘Apply’.

Notice that the graph is hard to read, and probably contains more information than we want. Let’s modify our query, and take a look at the top 5 states that have had the most recoveries. We will notice there are some extra rows in here that we don’t want – such as regions outside the US, and one row that shows all recoveries for that region. Let’s filter those out in the WHERE clause. Once we modify the query and re-run it, we will see that the chart automatically refreshes: