50 interview questions and answers for an Azure Databricks Platform Engineer

Mohammad Gufran Jahangir August 4, 2025 0

Here are 50 interview questions and answers for an Azure Databricks Platform Engineer, divided by skill level and covering key areas (core, advanced, and scenario-based):

Table of Contents

🔹 Essential Level (Core – 1 to 20)

What is Azure Databricks?
- A unified data analytics platform built on Apache Spark, optimized for Azure cloud for big data and AI workloads.
How is Azure Databricks different from HDInsight?
- Databricks provides a collaborative workspace with ML and Spark support, whereas HDInsight is a broader Hadoop-based platform.
What is a cluster in Databricks?
- A set of computation resources (driver and workers) used to run notebooks and jobs.
Difference between All-Purpose and Job clusters?
- All-Purpose: Interactive development.
- Job: For production jobs triggered via jobs API or pipeline.
What is a notebook in Databricks?
- A web-based interface to write and execute code in multiple languages like Python, Scala, SQL, and R.
What is the role of dbutils?
- Utilities to interact with DBFS, secrets, widgets, etc.
What is Unity Catalog?
- Centralized governance for data access, lineage, audit, and discovery across workspaces.
Explain the Databricks file system (DBFS).
- A distributed file system over object storage in Azure (ADLS/Blob).
What is a Delta Lake?
- A storage layer enabling ACID transactions on top of Data Lakes.
What is a Spark job vs. stage vs. task?
- Job = triggered action, Stage = execution plan, Task = execution unit.
What are widgets in Databricks?
- Interactive input controls for notebooks (text, dropdowns, multiselect).
What is Auto-scaling in Databricks?
- Databricks automatically adds/removes workers based on workload.
What are the Databricks runtimes (LTS, Photon)?
- Pre-built environments with specific Spark + ML/DL + GPU + performance features.
Difference between Azure Data Lake and Delta Lake.
- ADLS: Storage only.
- Delta Lake: Storage + transactional layer + schema enforcement.
What is Lakehouse architecture?
- Combines Data Warehouse (structured, reliable) with Data Lake (scalable, raw).
How is data versioned in Delta Lake?
- Using transaction logs (_delta_log folder), supports time travel.
How do you secure notebooks in Databricks?
- Workspace ACLs, Git integration, credential passthrough, and Unity Catalog RBAC.
What are cluster pools?
- A pool of idle VMs used to reduce cluster start time.
What is a workspace in Databricks?
- Logical container for notebooks, clusters, jobs, and libraries.
What is a secret scope in Databricks?
- Secure place to store and access secrets (via Azure Key Vault or Databricks-managed).

🔹 Advanced Level (21 to 35)

How do you implement CI/CD with Databricks and Azure DevOps?
- Git integration → Repos → YAML pipeline → dbx or CLI deployment → promote to higher envs.
What is Photon in Databricks?
- Native vectorized execution engine that boosts SQL performance (C++).
How do you monitor and debug Spark jobs in Databricks?
- Use Spark UI: Jobs tab, Stages, Executors, SQL, Thread Dump analysis.
What is the function of VACUUM in Delta Lake?
- Physically deletes old files no longer referenced in Delta tables.
What is the difference between append, overwrite, and merge in Delta?
- Append: Add data, Overwrite: Replace, Merge: UPSERT.
Explain data lineage in Unity Catalog.
- Tracks data flow from source to transformations to output (tables, views, notebooks).
How do you enforce data governance in Unity Catalog?
- Grant privileges on catalog/schema/table level; use RBAC, tags, audit logs.
What is the difference between Azure SQL Database and Synapse Analytics?
- SQL DB: OLTP.
- Synapse: OLAP with MPP for big data analytics.
How do you optimize cost in Databricks?
- Auto-terminate idle clusters, use job clusters, enable Photon, review cluster sizing, use pools.
How does Delta Lake handle schema evolution?
- Using mergeSchema or overwriteSchema options while writing.
Explain checkpointing in structured streaming.
- Saves state to resume processing after failure.
How to troubleshoot skewed stages in Spark?
- Check stage time in Spark UI, look for uneven tasks, optimize joins or repartition.
How to connect Databricks with Azure Data Factory?
- Using Web Activity or Databricks notebook activity via token authentication.
Explain workspace-level vs. Unity Catalog-level security.
- Workspace: ACL-based
- UC: Catalog/schema/table level using RBAC.
Difference between Delta Sharing and traditional APIs.
- Open protocol to securely share live data without copying.

🔹 Scenario-Based (36 to 50)

How would you set up a dev → test → prod pipeline for notebooks?
- Use Git repos, environment-specific config, CI/CD YAML pipelines, and approval gates.
A Spark job is taking 2 hours. How do you debug it?
- Use Spark UI → Job tab → Identify long stages → Check executor skew, GC time, or wide transformations.
Your Delta table queries are slow. What do you check?
- Check Z-ordering, data size, partitions, and whether Delta caching is used.
How would you manage secrets in multiple environments?
- Create environment-specific secret scopes, integrate with Azure Key Vault.
If a user sees a permission denied on a table, what do you do?
- Check Unity Catalog grants using SHOW GRANTS or UI access control.
You have hundreds of notebooks in Dev. How do you promote only a few to Prod?
- Use branch-based Git workflow or folder-specific release pipelines.
How do you automate cluster lifecycle for jobs?
- Define job clusters in job JSON config or use cluster pools.
Explain a use case where you’d prefer a job cluster over an all-purpose one.
- For scheduled nightly ETL jobs that don’t need interactivity.
What’s your backup strategy for Delta tables?
- Use checkpointing and versioning (time travel); optionally export to ADLS.
How do you audit user activity in Databricks?
- Use system.access.audit tables or Azure Monitor diagnostic logs.
A user mistakenly deletes a table. How do you recover?
- Use Delta Lake time travel: VERSION AS OF or TIMESTAMP AS OF.
Explain the benefits of Z-Ordering in Delta Lake.
- Optimizes file skipping for faster queries by co-locating related data.
You need to run ML jobs nightly. What’s your setup?
- Schedule notebook using Job API on a job cluster with required libraries.
What happens when two jobs write to the same Delta table?
- Delta uses optimistic concurrency; one job may fail if conflict occurs.
Your workspace has too many idle clusters. How do you clean it up?
- Enforce cluster auto-termination, monitor via audit logs, tag clusters for ownership.

Mohammad Gufran Jahangir

Tags: Databricks

Category: