Below Prerequisites required to start with Apache Spark pool:

  • You’ll need an Azure subscription. If needed, create a free Azure account
  • Synapse Analytics workspace
  • Serverless Apache Spark pool

Steps:

Sign in to the Azure portal

Create a notebook

create a simple Spark DataFrame object to manipulate. In this case, you create it from code. There are three rows and three columns:

new_rows = [('CA',22, 45000),("WA",35,65000) ,("WA",50,85000)]
demo_df = spark.createDataFrame(new_rows, ['state', 'age', 'salary'])
demo_df.show()

 Enter the code below in another cell and run it, this creates a Spark table, a CSV, and a Parquet file all with copies of the data:

demo_df.createOrReplaceTempView('demo_df')
 demo_df.write.csv('demo_df', mode='overwrite')
 demo_df.write.parquet('abfss://<<TheNameOfAStorageAccountFileSystem>>@<<TheNameOfAStorageAccount>>.dfs.core.windows.net/demodata/demo_df', mode='overwrite')

Run Spark SQL statements

The command lists the tables on the pool.

%%sql
SHOW TABLES

Run another query to see the data in demo_df.

%%sql
SELECT * FROM demo_df

It is possible to get the same experience of running SQL but without having to switch languages. You can do this by replacing the SQL cell above with this PySpark cell, the output experience is the same because the display command is used:

display(spark.sql('SELECT * FROM demo_df'))

Create a new serverless Apache Spark pool using the Azure portal

Below Prerequisites needed:

Synapse Analytics workspace

Navigate to the Synapse workspace


Create new Apache Spark pool

Overview of Synapse workspace with a red box around the command to create a new Apache Spark pool

Create a serverless Apache Spark pool using Synapse Studio

Navigate to the Synapse workspace

Azure portal search bar with Synapse workspaces typed in.

Launch Synapse Studio

Create the Apache Spark pool in Synapse Studio

Create an Apache Spark GPU-enabled Pool in Azure Synapse Analytics using the Azure portal

Create new Azure Synapse GPU-enabled pool

How to Analyze data with Apache Spark?

Here’s a step-by-step guide on how to analyze data with Apache Spark, using examples to make it easier to understand:

1. Import Necessary Libraries:

  • Begin by importing the SparkSession and other relevant libraries like SQL functions and data types.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

2. Create a SparkSession:

  • This creates a Spark application and connects to the Spark cluster.
spark = SparkSession.builder.appName("MySparkAnalysis").getOrCreate()

3. Load Data:

  • Read data from various sources like CSV, Parquet, JSON, databases, or other supported formats.
df = spark.read.csv("path/to/data.csv")

4. Explore Data:

  • Get a feel for the data’s structure, content, and basic statistics.
df.show()  # Display a few rows
df.printSchema()  # View schema
df.describe().show()  # Summary statistics

5. Clean and Transform Data:

  • Handle missing values, correct errors, filter data, create new columns, and format data as needed.
df = df.dropna()  # Remove rows with missing values
df = df.withColumn("new_column", col("column1") + col("column2"))  # Create a new column

6. Analyze Data:

  • Apply various analytical operations using Spark’s rich API.
  • Aggregations:
df.groupBy("category").agg(sum("sales")).show()
  • Joins:
df1.join(df2, on="customer_id").show()
  • Machine learning:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
# ... (Create features, train model, make predictions)

7. Visualize Results:

  • Use Spark’s built-in visualization capabilities or external libraries like matplotlib or Seaborn.
df.plot(kind="bar")  # Simple bar chart

8. Save Results:

  • Store results in various formats (CSV, Parquet, databases, etc.) for further use or sharing.
df.write.parquet("output_data.parquet")

9. Stop SparkSession:

  • Gracefully terminate the Spark application.
spark.stop()

How to analyze data with Apache Spark with Examples?

1. Gathering Tools:

  • Import Libraries: Like a chef gathering their tools and ingredients, you start by importing the necessary Spark libraries to work with data.
from pyspark.sql import SparkSession  # The main Spark tool
from pyspark.sql.functions import *  # Extra tools for data manipulation

2. Setting Up the Kitchen:

  • Create SparkSession: This creates a workspace, like a kitchen, where you’ll prepare and analyze your data.
spark = SparkSession.builder.appName("MySparkAnalysis").getOrCreate()

3. Bringing in the Ingredients:

  • Load Data: Fetch your data, whether it’s in a CSV file, a database, or other formats, and bring it into Spark for cooking.
df = spark.read.csv("path/to/data.csv")

4. Checking the Freshness:

  • Explore Data: Take a look at the data’s structure, like checking your ingredients for freshness. Find out what’s inside and get a general idea of its contents.
df.show()  # Preview a few rows
df.printSchema()  # See the data's "recipe" (structure)

5. Prepping and Cleaning:

  • Clean and Transform Data: Prepare the data for analysis, like washing vegetables or chopping ingredients. Handle missing values, correct errors, and shape the data into a usable form.
df = df.dropna()  # Toss out rows with missing values (like rotten bits)

6. Cooking and Mixing:

  • Analyze Data: This is where the magic happens! Use Spark’s powerful tools to:
    • Find patterns and trends (like finding the perfect spice blend).
    • Calculate summaries and aggregates (like measuring total ingredients).
    • Join different datasets together (like combining flavors).
    • Build machine learning models (like creating new recipes).

7. Plating and Presentation:

  • Visualize Results: Create charts and graphs to visualize the insights you’ve uncovered, making them easy to understand and share, like plating a delicious meal.
df.plot(kind="bar")  # Simple bar chart, like arranging food on a plate

8. Saving Leftovers:

  • Save Results: Store your findings for later use or sharing, like storing leftovers in the fridge.
df.write.parquet("output_data.parquet")  # Save in a format like Parquet

9. Cleaning Up:

  • Stop SparkSession: Finish your Spark session and clean up the workspace, like tidying up the kitchen after cooking.
spark.stop()

Monitor your Apache Spark applications in Synapse Studio

With Azure Synapse Analytics, you can use Apache Spark to run notebooks, jobs, and other kinds of applications on your Apache Spark pools in your workspace.

Access Apache Spark applications list

To see the list of Apache Spark applications in your workspace, first open the Synapse Studio and select your workspace.

Once you’ve opened your workspace, select the Monitor section on the left.

Select Apache Spark applications to view the list of Apache Spark applications.

Filter your Apache Spark applications

You can filter the list of Apache Spark applications to the ones you’re interested in. The filters at the top of the screen allow you to specify a field on which you’d like to filter.

For example, you can filter the view to see only the Apache Spark applications that contain the name “sales”:

View Apache Spark applications

You can view all Apache Spark applications from Monitor -> Apache Spark applications.

View completed Apache Spark applications