,

Job Failures with External Libraries in Databricks: Causes and Solutions

Posted by

Introduction

Databricks allows users to install external libraries (JARs, Python wheels, PyPI packages) to extend functionality in notebooks and jobs. However, job failures due to library issues are common and can be caused by dependency conflicts, network connectivity issues, incorrect library versions, or missing permissions.

In this guide, we’ll explore the common causes of external library-related job failures, how to diagnose these issues, and best practices to ensure smooth execution.


How External Libraries Work in Databricks

Databricks supports various types of external libraries:

  • PyPI Packages (e.g., pandas, numpy, scikit-learn)
  • Maven or JAR Dependencies (e.g., spark-avro, delta-core_2.12)
  • Custom Python Wheels (.whl files)
  • Custom JAR Files uploaded to DBFS or cloud storage

💡 Libraries can be installed via:

  • Databricks UI (Cluster Libraries)
  • Notebook %pip install or %maven install
  • Databricks Jobs API (install_library method)

🚨 Common issues include version mismatches, missing dependencies, and incompatible environments.


Common Causes of External Library Job Failures

1. Version Conflicts Between Installed Libraries

Symptoms:

  • Error: “ModuleNotFoundError: No module named ‘package_name’”
  • Error: “ImportError: cannot import name ‘XYZ’ from ‘package_name’”
  • Unexpected behavior due to different versions of installed libraries

Causes:

  • Conflicting package versions installed at cluster-level vs. notebook-level
  • Incompatible versions between Databricks Runtime and library requirements
  • Automatic package resolution installing unexpected versions

Fix:
Use Notebook-Scoped Libraries (%pip install instead of UI installation)

%pip install pandas==1.3.3

Manually resolve dependency conflicts using pip check:

!pip check

Use conda-environment.yaml for consistent dependencies across environments

name: myenv
dependencies:
  - python=3.9
  - pandas=1.3.3
  - numpy=1.21

2. Network Connectivity Issues Preventing Library Installation

Symptoms:

  • Error: “Connection timed out while installing library”
  • Error: “Could not find a version that satisfies the requirement”
  • Jobs fail intermittently due to package retrieval failures

Causes:

  • No internet access on Databricks clusters (firewalled environment)
  • Private PyPI or Maven repositories not accessible
  • Cloud VPC/VNet blocking outbound traffic to package repositories

Fix:
Enable cloud networking to allow package downloads (AWS VPC, Azure Private Link)
Use a private PyPI repository instead of internet-based sources

%pip config set global.index-url https://pypi.yourcompany.com/simple/

Preinstall libraries in a custom Databricks ML Runtime Image


3. Missing Required Libraries or Incorrect Import Paths

Symptoms:

  • Error: “ModuleNotFoundError: No module named ‘xyz’”
  • Libraries work in interactive notebooks but fail in scheduled jobs

Causes:

  • Library is installed only in notebook scope (%pip install) and not available in the job
  • Incorrect Python environment paths in scheduled jobs
  • Missing dependencies not automatically installed

Fix:
Ensure libraries are installed at the correct scope:

  • Use Cluster Libraries for persistent installation
  • Use %pip install inside the job script for runtime installs

📌 Example: Installing in a Job Notebook Before Execution

%pip install requests
import requests

For JAR-based dependencies, install via Maven:

%scala
import org.apache.spark.sql.functions._

4. Job Failures Due to Incompatible Java JARs or Missing Dependencies

Symptoms:

  • Error: “ClassNotFoundException: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe”
  • Error: “java.lang.NoClassDefFoundError: org.apache.spark.sql.delta.DeltaLog”
  • Spark jobs fail when interacting with Delta Lake, Hadoop, or JDBC drivers

Causes:

  • Incorrect JAR versions conflicting with Spark runtime
  • Missing JAR files in classpath
  • Conflicting versions of delta-core or hadoop-common JARs

Fix:
Install the correct JAR versions via Maven:

%scala
%library add mvn:io.delta:delta-core_2.12:1.2.0

Ensure JARs are installed at the cluster level for shared execution
For custom JARs, upload to DBFS and reference correctly:

dbutils.fs.cp("dbfs:/FileStore/libs/mycustom.jar", "file:/databricks/jars/")

5. PyPI Package Installations Failing Due to Missing System Dependencies

Symptoms:

  • Error: “OSError: libGL.so.1: cannot open shared object file”
  • Error: “Failed building wheel for XYZ”
  • Job fails but works fine in local Python execution

Causes:

  • Some Python packages require underlying OS dependencies (C++, OpenCV, TensorFlow)
  • Databricks does not allow direct system package installations

Fix:
Use Databricks ML Runtime if using deep learning libraries
For complex dependencies, install via conda instead of pip

%sh
conda install -c conda-forge opencv

Step-by-Step Troubleshooting Guide

1. Check Installed Libraries

%pip list

2. Verify If Any Dependency Conflicts Exist

!pip check

3. Check Cluster Logs for Library Installation Errors

  • Go to Databricks UI → Clusters → Libraries → Event Log

4. Debug Library Paths for JAR Issues

import sys
print(sys.path)

5. Test Library Installation in an Interactive Notebook


import pandas as pd
print(pd.__version__)

Best Practices to Prevent Library-Related Job Failures

Use Cluster Libraries Instead of Notebook %pip install for Jobs

  • Notebook-scoped %pip install does not persist across job runs.
  • Install at cluster-level for jobs to ensure consistent availability.

Use Requirements.txt or Conda for Dependency Management

name: myenv
dependencies:
  - python=3.9
  - pandas=1.3.3
  - numpy=1.21

Use Databricks ML Runtime for Machine Learning Dependencies

  • Avoid installing large ML libraries manually (tensorflow, pytorch).
  • Use Databricks ML Runtimes that come pre-installed with ML packages.

Monitor Job and Library Logs

  • Set up Databricks Alerts for failed installations.
  • Monitor DBFS logs for missing dependency issues.

Real-World Example: Fixing a Job Failure Due to Pandas Version Conflict

Scenario:

A Databricks job running an ETL pipeline failed with a pandas version mismatch.

Root Cause:

  • The Databricks runtime used pandas 1.2, but the job required pandas 1.3.
  • A mix of cluster-scoped and notebook-scoped installations caused conflicts.

Solution:

  1. Uninstalled existing versions and reinstalled the correct one:
%pip uninstall pandas -y
%pip install pandas==1.3.3
  1. Updated job environment to use the correct version.

Impact:

  • The ETL job ran successfully with consistent library versions.

Conclusion

Library-related job failures in Databricks often stem from dependency conflicts, network issues, missing system packages, or incompatible JARs. By ensuring proper package management, leveraging Databricks ML Runtimes, and using cluster-level installations, teams can prevent failures and maintain stable job execution.

guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x