Introduction to Azure Databricks

Posted by

Introduction to Azure Databricks

  • Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform.
  • Azure Databricks will provide you one-click setup, streamlined workflows, and interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
  • This databricks capable to interact with Azure blob storage , Azure Data lake store, Azure SQL data Warehouse, Apache Kafka, Hadoop storage etc.
  • The logic implemented via Databricks can be use in various application ML, streaming, Power BI, data warehousing etc.

There may be various applications such as business apps and custom apps, each containing different types of data. This data, often in an unstructured format (such as CSV files or other file formats), originates from multiple applications. The process involves taking this data, processing it, and converting it into meaningful information.

This processed and meaningful data is valuable for understanding the business and aiding in its growth.

The process of transforming data is known as data transformation, a key function where tools like Apache Spark, which processes big data, are utilized.

Azure has implemented Apache Spark under the name Azure Databricks.

In the diagram below, we can see a Data Factory pipeline that retrieves data from various sources such as blob storage, data lakes, or Kafka. The ADF pipeline then transforms this data into meaningful information and stores it in destinations like Cosmos DB, SQL databases, or data warehouses.

Whenever there is a need to transform big data into meaningful information, Databricks is essential

Apache Spark Ecosystem has below feature

  • Spark SQL + DataFrames
  • Streaming
  • Machine Learning
  • Graph computation

To perform transformation we can use below any language

  • R
  • SQL
  • python
  • Scala
  • Java

Apache spark

It has ability to process very quickly the structure or non structural data. Azure databricks works on Apache spark concepts. Backened for Databrick is Apache spark,

  • Apache Spark: Described as a lightning-fast unified analytics engine for big data processing and machine learning.
  • 100% Open source under Apache License: Indicates that Spark is fully open source, allowing it to be freely used, modified, and distributed under the terms of the Apache License.
  • Simple & Easy to use API: Emphasizes that Spark is designed to be user-friendly, with APIs that are simple to understand and implement, making it accessible to a wide range of developers.
  • In-memory processing engine: This feature enhances the speed and performance of data processing tasks by storing data in RAM instead of slower disk-based storage.
  • Distributed computing platform: Spark can process large datasets across many networked computers using distributed computing techniques. Jobs run in parallel form. It has parallel processing features.
  • Unified engine which supports SQL, streaming, ML, and graphing processing: Spark provides a comprehensive suite that includes support for SQL queries, real-time data streaming, machine learning algorithms, and graph processing.
  • Integrates closely with other Big Data tools: Spark works seamlessly with other tools in the big data ecosystem, making it a versatile choice for complex data processing pipelines.

Azure Databricks

At bottom – Yarn, Mesos, Standalone Scheduler -these managed at the cluster

Middle – Spark Core – is contains the basic functionality of spark like task scheduling, memory management, fault tolerance , interact with the storage system

At top – Spark SQL – its package of working the structure data, it also supporting different source format like parquet file, hive tables and JSON file.

Spark Streaming – It is light data processing – we are receiving the data and processing all data

MLlib – It provide multiples machine learning alorighm

GraphX – utilize the graphic option

At the bottom we having cluster and in middle spark core which do task scheduling and getting requirement from cluster and transfer to top layer which having multiples feature.

Choice Azure Databricks over Azure Data Factory

Inline Feedbacks
View all comments
Would love your thoughts, please comment.x