Azure Databricks Cluster and Types of cluster

Posted by

Azure Databricks Cluster

A cluster is a group of machines called “nodes” that work together to process your data and queries efficiently. 

A set of computation resources and configuration on which you run notebooks and jobs

Types of clusters

The two types of clusters in Azure Databricks and their characteristics

Cluster TypeCreationLifespanSuitabilityUser AccessCost
All Purpose ClusterCreated ManuallyPersistentSuitable for interactive workloadsShared among many usersExpensive to run
Job ClusterCreated by JobsTerminated at the end of the JobsSuitable for automated workloadsIsolated just for the jobsCheaper to run

Types of compute can see below

All purpose compute

  • Job compute
  • pools
  • Policies

when creating compute under all purpose compute – you can see below policy– chose default the unrestricted

Next select node types whether single node or multi node

if selecting single node worker type will be not there but if selecting multi node worker type will be there

if selecting multi node

what is worker types?

as per need we can select worker types and we can choose min workers and max workers

also can see spot instances

We have Drive types and having multiples option – can choose as per need

Auto scaling the workers as we have min 2 and max workers as 8 according to workload it should auto scaling

also see when we can terminate it

Inside advance option we can see spark , logging and init scripts

in spark can define spark config and environment variable

in logging can configure logging destination

Init script – if our cluster going to use in any of the notebook by default python libraries automatically imported in init script tab

Go through link for more details about Init scripts in databricks

For single node :

For real time situation please use latest LTS version

Node Type: choose as per the work environment

we can Clone , delete and give the permission to cluster

Cluster Configuration

1. Cluster mode

Multi Node

  • Inside the multi node we having Drive node VM and worker node VM.
  • Whenever request comes the driver node distribute the tasks into the worker node.
  • Task divided parallelly in the worker nodes .
  • Worker nodes can be created as per the needs, Virtual machine or driver nodes taking care of all these

Single Node

  • It will be having only single VM and we can call this either driver node or worker node
  • when we receive any request first it will receiving as driver node and working as worker node.

2. Access Mode

The access configurations are detailed in three categories:

  • Single User: Allows only one user access. Supports Python, SQL, Scala, and R programming languages.
  • Shared User: Allows multiple user access and is only available in the premium version of the platform. Supports Python, SQL, Scala, and R.
  • No Isolation Shared: Allows multiple user access without isolation among the users, meaning users can potentially access each other’s work or data. Supports Python, SQL, Scala, and R.

3. Databricks Runtime

This section is elaborately divided into several runtime options, each tailored for specific types of workloads:

  • Databricks Runtime:
    • Includes Apache Spark and supports programming languages Scala, Python, SQL, and R.
    • Runs on an Ubuntu environment with various libraries and supports GPU acceleration and Delta Lake for reliable data storage.
  • Databricks Runtime ML:
    • Inherits everything from the standard Databricks Runtime.
    • Includes popular machine learning libraries like Pytorch, Keras, TensorFlow, and XGBoost.
  • Databricks Runtime Genomics:
    • Also based on the standard Databricks Runtime.
    • Features popular open-source genomics libraries such as Glow and Adam and supports genomic pipelines like DNAseq and RNAseq.
  • Databricks Runtime Light:
    • A lightweight runtime option designed for jobs that do not require advanced features, possibly focusing on lower resource consumption and cost efficiency.

    4. Auto Termination

    5. Auto Scaling

    • User specifies the min and max work nodes
    • Auto scales between min and max based on the workloads
    • Not recommended for streaming workloads

    6. Cluster VM Type/Size

    Inline Feedbacks
    View all comments
    Would love your thoughts, please comment.x