In this tutorial we will learn in depth about DataFrame API in databrick

What is schema of DataFrame?

In Databricks, a DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database. Each column has a specific data type, such as string, integer, or double. The schema of a DataFrame defines the names and data types of its columns.

Here’s a breakdown of what schema is in Databricks, with step-by-step explanations and simple examples:

1. What is a schema?

  • Think of a schema as the blueprint of a DataFrame. It tells you what information each column holds and how it’s structured.
  • Just like a table of contents in a book, the schema helps you understand the organization of the data within the DataFrame.

2. Components of a schema:

  • Column names: Each column in the DataFrame has a unique name that identifies it.
  • Data types: Each column has a data type that specifies the type of data it holds, such as string, integer, double, boolean, or complex data types like arrays or structures.
  • Constraints (optional): Some schemas may contain additional constraints on the data, such as specifying minimum or maximum values or enforcing specific format rules.

3. Understanding schema with an example:

Imagine a DataFrame containing information about customers. It might have columns like:

  • name: string
  • age: integer
  • city: string
  • purchases: array of strings

This schema tells us that the DataFrame has four columns:

  • name holds text data and represents the customer’s name.
  • age holds integer values and represents the customer’s age.
  • city holds text data and represents the customer’s city.
  • purchases holds an array of strings representing the customer’s purchase history.

4. Importance of schema:

  • The schema plays a crucial role in ensuring data consistency and validity.
  • It allows Databricks to optimize data processing and analysis tasks by understanding the structure of the data.
  • Knowing the schema helps you understand how to access and manipulate data within the DataFrame.

5. Accessing and modifying schema:

  • Databricks provides various ways to access and modify the schema of a DataFrame.
  • You can use built-in functions to get the schema information, such as df.printSchema().
  • You can also manipulate the schema by adding, renaming, or dropping columns.

Databricks Spark Data Types in details

Databricks Spark, data types define the kind of data that can be stored in columns within a DataFrame. Here’s an in-depth look at some commonly used data types and examples of how they are used:

1. Numeric Types:

  • Integer Type (IntegerType): Represents whole numbers.

from pyspark.sql.types import IntegerType # Example: Creating a DataFrame column with IntegerType df = spark.createDataFrame([(1,), (2,), (3,)], ["Number"])
  • Decimal Type (DecimalType): Represents fixed-point numbers.

from pyspark.sql.types import DecimalType # Example: Creating a DataFrame column with DecimalType df = spark.createDataFrame([(2.5,), (3.7,), (4.1,)], ["Decimal"])

2. String Types:

  • String Type (StringType): Represents text data.

from pyspark.sql.types import StringType # Example: Creating a DataFrame column with StringType df = spark.createDataFrame([("Alice",), ("Bob",), ("Charlie",)], ["Name"])
  • Character Type (CharType), Varying Character Type (VarcharType): Specific lengths for strings.

from pyspark.sql.types import CharType, VarcharType # Example: Creating a DataFrame column with CharType and VarcharType df = spark.createDataFrame([("ABC",), ("DEF",), ("GHI",)], ["CharColumn"])

3. Boolean Type:

  • Boolean Type (BooleanType): Represents true/false values.

from pyspark.sql.types import BooleanType # Example: Creating a DataFrame column with BooleanType df = spark.createDataFrame([(True,), (False,), (True,)], ["Is_Single"])

4. Date and Timestamp Types:

  • Date Type (DateType): Represents calendar dates.

from pyspark.sql.types import DateType # Example: Creating a DataFrame column with DateType df = spark.createDataFrame([('2023-12-01',), ('2023-12-02',), ('2023-12-03',)], ["Date"])
  • Timestamp Type (TimestampType): Represents date and time information.

from pyspark.sql.types import TimestampType # Example: Creating a DataFrame column with TimestampType df = spark.createDataFrame([('2023-12-01 12:30:45',), ('2023-12-02 13:40:22',), ('2023-12-03 14:15:30',)], ["Timestamp"])

5. Complex Types:

  • Array Type (ArrayType): Represents arrays or lists of elements.

from pyspark.sql.types import ArrayType, StringType # Example: Creating a DataFrame column with ArrayType df = spark.createDataFrame([(["apple", "banana"],), (["orange", "grapes"],), (["melon", "strawberry"],)], ["Fruits"])
  • Map Type (MapType): Represents key-value pairs.

from pyspark.sql.types import MapType, StringType, IntegerType # Example: Creating a DataFrame column with MapType df = spark.createDataFrame([({"a": 1, "b": 2},), ({"x": 10, "y": 20},)

Creating DataFrames with Schema in Spark in different way

Spark provides several methods to create a DataFrame with a pre-defined schema. Here are some common approaches, explained step-by-step with examples:

1. Using createDataFrame(data, schema):

  • This method takes two arguments:
    • data: This can be a list, Seq, RDD, or DataFrame containing the actual data.
    • schema: This is a StructType object representing the desired schema of the DataFrame.
  • Example:
from pyspark.sql import SparkSession, Row

spark = SparkSession.builder.getOrCreate()

# Define data
data = [
    Row(name="Alice", age=25, city="Seattle"),
    Row(name="Bob", age=30, city="New York"),
    Row(name="Charlie", age=28, city="Los Angeles"),
]

# Define schema
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True),
])

# Create DataFrame with schema
df = spark.createDataFrame(data, schema)

# Show DataFrame
df.show()

2. Using toDF():

  • This method allows you to convert an existing data structure like a list or RDD into a DataFrame with a provided schema.
  • Example:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Define data
data = [("Alice", 25, "Seattle"), ("Bob", 30, "New York"), ("Charlie", 28, "Los Angeles")]

# Define schema
schema = ["name", "age", "city"]

# Create DataFrame with schema
df = spark.sparkContext.parallelize(data).toDF(schema)

# Show DataFrame
df.show()

3. Using StructType with implicit conversions (Scala only):

  • This approach leverages Scala’s implicit conversions to create a DataFrame with a schema inferred from case class or tuple data.
  • Example:
case class Person(name: String, age: Int, city: String)

val spark: SparkSession = SparkSession.builder.getOrCreate()

val data = Seq(Person("Alice", 25, "Seattle"), Person("Bob", 30, "New York"), Person("Charlie", 28, "Los Angeles"))

val df = spark.implicits.toDF(data)

df.show()

4. Reading data with a predefined schema:

  • When reading data from external sources like CSV files, Spark allows you to specify the schema directly during the read operation.
  • Example:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Define schema
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True),
])

# Read data with schema
df = spark.read.option("header", True).schema(schema).csv("data.csv")

# Show DataFrame
df.show()

How to create DataFrame from CSV file in SCALA?

Simple dataframe creation from csv file

val df = spark.read.csv("file_path.csv")

Create dataframe by auto read schema

val df= spark.read.format("csv").option("inferSchema", "true").load("file_path.csv")

Create dataframe with header



val df= spark.read.format("csv").option("inferSchema", "true").option("header","true").load("file_path.csv")

How to create DataFrame from CSV file in pySpark?

Simple dataframe creation from csv file

df = spark.read.csv("file_path.csv")

Create dataframe by auto read schema

df= spark.read.format("csv").option("inferSchema", "true").load("file_path.csv")

Create dataframe with header


df= spark.read.format("csv").option("inferSchema", "true").option("header","true").load("file_path.csv")

How to create DataFrame from JSON file in SCALA?

Simple dataframe creation from csv file

val df = spark.read.json("file_path.json")

Create dataframe by auto read schema

val df= spark.read.format("json").option("inferSchema", "true").load("file_path.json")

Create dataframe with header



val df= spark.read.format("json").option("inferSchema", "true").option("header","true").load("file_path.json")

How to create DataFrame from JSON file in pySpark?

Simple dataframe creation from csv file

df = spark.read.json("file_path.json")

Create dataframe by auto read schema

df= spark.read.format("json").option("inferSchema", "true").load("file_path.json")

Create dataframe with header


df= spark.read.format("json").option("inferSchema", "true").option("header","true").load("file_path.json")

What are the different ways to define schema in Spark?

Schema can be defined in two different ways in spark.

  1. Programmatical way
  2. Declarative way

Above we have seen the Programmatical way. In the declarative way we employ a Data Definition Language (DDL) string, which is much simpler and easier to read.

Defining the dataframe using DDL schema example SCALA:

import org.apache.spark.sql.types._
val schema = "name STRING, address STRING, salary INT"
val data = Seq(Row("Mike","Wellington",2000),Row("Sam","New York",232432))

val empDf = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  StructType.fromDDL(schema)
)
empDf.printSchema()

Defining the dataframe using DDL schema example pyspark Python :

from pyspark.sql.types import *

data = [["Mike","Wellington",2000],["Sam","New York",232432]]
schema1 = "name STRING, address STRING, salary INT"
emp_df = spark.createDataFrame(data, schema1)
emp_df.printSchema()