Introduction
Databricks allows users to install external libraries (JARs, Python wheels, PyPI packages) to extend functionality in notebooks and jobs. However, job failures due to library issues are common and can be caused by dependency conflicts, network connectivity issues, incorrect library versions, or missing permissions.
In this guide, we’ll explore the common causes of external library-related job failures, how to diagnose these issues, and best practices to ensure smooth execution.
How External Libraries Work in Databricks
Databricks supports various types of external libraries:
- PyPI Packages (e.g.,
pandas
,numpy
,scikit-learn
) - Maven or JAR Dependencies (e.g.,
spark-avro
,delta-core_2.12
) - Custom Python Wheels (
.whl
files) - Custom JAR Files uploaded to DBFS or cloud storage
💡 Libraries can be installed via:
- Databricks UI (Cluster Libraries)
- Notebook
%pip install
or%maven install
- Databricks Jobs API (
install_library
method)
🚨 Common issues include version mismatches, missing dependencies, and incompatible environments.
Common Causes of External Library Job Failures
1. Version Conflicts Between Installed Libraries
Symptoms:
- Error: “ModuleNotFoundError: No module named ‘package_name’”
- Error: “ImportError: cannot import name ‘XYZ’ from ‘package_name’”
- Unexpected behavior due to different versions of installed libraries
Causes:
- Conflicting package versions installed at cluster-level vs. notebook-level
- Incompatible versions between Databricks Runtime and library requirements
- Automatic package resolution installing unexpected versions
Fix:
✅ Use Notebook-Scoped Libraries (%pip install
instead of UI installation)
%pip install pandas==1.3.3
✅ Manually resolve dependency conflicts using pip check
:
!pip check
✅ Use conda-environment.yaml for consistent dependencies across environments
name: myenv
dependencies:
- python=3.9
- pandas=1.3.3
- numpy=1.21
2. Network Connectivity Issues Preventing Library Installation
Symptoms:
- Error: “Connection timed out while installing library”
- Error: “Could not find a version that satisfies the requirement”
- Jobs fail intermittently due to package retrieval failures
Causes:
- No internet access on Databricks clusters (firewalled environment)
- Private PyPI or Maven repositories not accessible
- Cloud VPC/VNet blocking outbound traffic to package repositories
Fix:
✅ Enable cloud networking to allow package downloads (AWS VPC, Azure Private Link)
✅ Use a private PyPI repository instead of internet-based sources
%pip config set global.index-url https://pypi.yourcompany.com/simple/
✅ Preinstall libraries in a custom Databricks ML Runtime Image
3. Missing Required Libraries or Incorrect Import Paths
Symptoms:
- Error: “ModuleNotFoundError: No module named ‘xyz’”
- Libraries work in interactive notebooks but fail in scheduled jobs
Causes:
- Library is installed only in notebook scope (
%pip install
) and not available in the job - Incorrect Python environment paths in scheduled jobs
- Missing dependencies not automatically installed
Fix:
✅ Ensure libraries are installed at the correct scope:
- Use Cluster Libraries for persistent installation
- Use
%pip install
inside the job script for runtime installs
📌 Example: Installing in a Job Notebook Before Execution
%pip install requests
import requests
✅ For JAR-based dependencies, install via Maven:
%scala
import org.apache.spark.sql.functions._
4. Job Failures Due to Incompatible Java JARs or Missing Dependencies
Symptoms:
- Error: “ClassNotFoundException: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe”
- Error: “java.lang.NoClassDefFoundError: org.apache.spark.sql.delta.DeltaLog”
- Spark jobs fail when interacting with Delta Lake, Hadoop, or JDBC drivers
Causes:
- Incorrect JAR versions conflicting with Spark runtime
- Missing JAR files in classpath
- Conflicting versions of
delta-core
orhadoop-common
JARs
Fix:
✅ Install the correct JAR versions via Maven:
%scala
%library add mvn:io.delta:delta-core_2.12:1.2.0
✅ Ensure JARs are installed at the cluster level for shared execution
✅ For custom JARs, upload to DBFS and reference correctly:
dbutils.fs.cp("dbfs:/FileStore/libs/mycustom.jar", "file:/databricks/jars/")
5. PyPI Package Installations Failing Due to Missing System Dependencies
Symptoms:
- Error: “OSError: libGL.so.1: cannot open shared object file”
- Error: “Failed building wheel for XYZ”
- Job fails but works fine in local Python execution
Causes:
- Some Python packages require underlying OS dependencies (C++, OpenCV, TensorFlow)
- Databricks does not allow direct system package installations
Fix:
✅ Use Databricks ML Runtime if using deep learning libraries
✅ For complex dependencies, install via conda instead of pip
%sh
conda install -c conda-forge opencv
Step-by-Step Troubleshooting Guide
1. Check Installed Libraries
%pip list
2. Verify If Any Dependency Conflicts Exist
!pip check
3. Check Cluster Logs for Library Installation Errors
- Go to Databricks UI → Clusters → Libraries → Event Log
4. Debug Library Paths for JAR Issues
import sys
print(sys.path)
5. Test Library Installation in an Interactive Notebook
import pandas as pd
print(pd.__version__)
Best Practices to Prevent Library-Related Job Failures
✅ Use Cluster Libraries Instead of Notebook %pip install
for Jobs
- Notebook-scoped
%pip install
does not persist across job runs. - Install at cluster-level for jobs to ensure consistent availability.
✅ Use Requirements.txt or Conda for Dependency Management
name: myenv
dependencies:
- python=3.9
- pandas=1.3.3
- numpy=1.21
✅ Use Databricks ML Runtime for Machine Learning Dependencies
- Avoid installing large ML libraries manually (
tensorflow
,pytorch
). - Use Databricks ML Runtimes that come pre-installed with ML packages.
✅ Monitor Job and Library Logs
- Set up Databricks Alerts for failed installations.
- Monitor DBFS logs for missing dependency issues.
Real-World Example: Fixing a Job Failure Due to Pandas Version Conflict
Scenario:
A Databricks job running an ETL pipeline failed with a pandas version mismatch.
Root Cause:
- The Databricks runtime used pandas 1.2, but the job required pandas 1.3.
- A mix of cluster-scoped and notebook-scoped installations caused conflicts.
Solution:
- Uninstalled existing versions and reinstalled the correct one:
%pip uninstall pandas -y
%pip install pandas==1.3.3
- Updated job environment to use the correct version.
✅ Impact:
- The ETL job ran successfully with consistent library versions.
Conclusion
Library-related job failures in Databricks often stem from dependency conflicts, network issues, missing system packages, or incompatible JARs. By ensuring proper package management, leveraging Databricks ML Runtimes, and using cluster-level installations, teams can prevent failures and maintain stable job execution.