Init scripts in Databricks are scripts that run during the startup of a cluster or when a new node is added to an existing cluster. These scripts are used to configure the cluster environment before notebooks, jobs, or data engineering tasks are executed. This setup can include installing packages, downloading data, setting environment variables, or applying configuration changes that are not natively supported through the Databricks environment.
Simple Example of an Init Script
Let’s consider a scenario where you need to install a specific Python package on every node of a Databricks cluster. You would use an init script to ensure this package is installed before any jobs are run.
Create the Init Script:
- You would write a bash script that includes the command to install the Python package.
#!/bin/bash
/databricks/python/bin/pip install your-package-name
Upload the Script to a DBFS (Databricks File System) Location:
You could upload the script to a location like
dbfs:/databricks/scripts/install_python_package.sh
Configure the Cluster to Use the Init Script:
- When creating or editing a cluster, you specify the init script path under the “Advanced Options” section in the cluster configuration.
- Enter the DBFS path to the script, such as
dbfs:/databricks/scripts/install_python_package.sh
.
Start the Cluster:
- Upon startup, Databricks runs the init script, installing the specified package on each node of the cluster.
- Any notebook or job that runs on this cluster will have access to the installed package.
Benefits of Using Init Scripts
- Consistency: Ensures all cluster nodes are configured identically, reducing inconsistencies in runtime environments.
- Automation: Automates the setup process for complex installations or configurations, saving time and reducing manual errors.
- Flexibility: Allows for custom configurations that are not directly supported through the Databricks UI.