Azure Databricks Cluster
A cluster is a group of machines called “nodes” that work together to process your data and queries efficiently.
A set of computation resources and configuration on which you run notebooks and jobs
Types of clusters
The two types of clusters in Azure Databricks and their characteristics
Cluster Type | Creation | Lifespan | Suitability | User Access | Cost |
---|---|---|---|---|---|
All Purpose Cluster | Created Manually | Persistent | Suitable for interactive workloads | Shared among many users | Expensive to run |
Job Cluster | Created by Jobs | Terminated at the end of the Jobs | Suitable for automated workloads | Isolated just for the jobs | Cheaper to run |
Types of compute can see below
All purpose compute
- Job compute
- pools
- Policies
when creating compute under all purpose compute – you can see below policy– chose default the unrestricted
Next select node types whether single node or multi node
if selecting single node worker type will be not there but if selecting multi node worker type will be there
if selecting multi node
what is worker types?
as per need we can select worker types and we can choose min workers and max workers
also can see spot instances
We have Drive types and having multiples option – can choose as per need
Auto scaling the workers as we have min 2 and max workers as 8 according to workload it should auto scaling
also see when we can terminate it
Inside advance option we can see spark , logging and init scripts
in spark can define spark config and environment variable
in logging can configure logging destination
Init script – if our cluster going to use in any of the notebook by default python libraries automatically imported in init script tab
Go through link for more details about Init scripts in databricks https://www.cloudopsnow.in/what-is-init-scripts-in-databricks/
For single node :
For real time situation please use latest LTS version
Node Type: choose as per the work environment
we can Clone , delete and give the permission to cluster
Cluster Configuration
1. Cluster mode
Multi Node
- Inside the multi node we having Drive node VM and worker node VM.
- Whenever request comes the driver node distribute the tasks into the worker node.
- Task divided parallelly in the worker nodes .
- Worker nodes can be created as per the needs, Virtual machine or driver nodes taking care of all these
Single Node
- It will be having only single VM and we can call this either driver node or worker node
- when we receive any request first it will receiving as driver node and working as worker node.
2. Access Mode
The access configurations are detailed in three categories:
- Single User: Allows only one user access. Supports Python, SQL, Scala, and R programming languages.
- Shared User: Allows multiple user access and is only available in the premium version of the platform. Supports Python, SQL, Scala, and R.
- No Isolation Shared: Allows multiple user access without isolation among the users, meaning users can potentially access each other’s work or data. Supports Python, SQL, Scala, and R.
3. Databricks Runtime
This section is elaborately divided into several runtime options, each tailored for specific types of workloads:
- Databricks Runtime:
- Includes Apache Spark and supports programming languages Scala, Python, SQL, and R.
- Runs on an Ubuntu environment with various libraries and supports GPU acceleration and Delta Lake for reliable data storage.
- Databricks Runtime ML:
- Inherits everything from the standard Databricks Runtime.
- Includes popular machine learning libraries like Pytorch, Keras, TensorFlow, and XGBoost.
- Databricks Runtime Genomics:
- Also based on the standard Databricks Runtime.
- Features popular open-source genomics libraries such as Glow and Adam and supports genomic pipelines like DNAseq and RNAseq.
- Databricks Runtime Light:
- A lightweight runtime option designed for jobs that do not require advanced features, possibly focusing on lower resource consumption and cost efficiency.
4. Auto Termination
5. Auto Scaling
- User specifies the min and max work nodes
- Auto scales between min and max based on the workloads
- Not recommended for streaming workloads