Mohammad Gufran Jahangir August 4, 2025 0

Here are 50 interview questions and answers for an Azure Databricks Platform Engineer, divided by skill level and covering key areas (core, advanced, and scenario-based):


🔹 Essential Level (Core – 1 to 20)

  1. What is Azure Databricks?
    • A unified data analytics platform built on Apache Spark, optimized for Azure cloud for big data and AI workloads.
  2. How is Azure Databricks different from HDInsight?
    • Databricks provides a collaborative workspace with ML and Spark support, whereas HDInsight is a broader Hadoop-based platform.
  3. What is a cluster in Databricks?
    • A set of computation resources (driver and workers) used to run notebooks and jobs.
  4. Difference between All-Purpose and Job clusters?
    • All-Purpose: Interactive development.
    • Job: For production jobs triggered via jobs API or pipeline.
  5. What is a notebook in Databricks?
    • A web-based interface to write and execute code in multiple languages like Python, Scala, SQL, and R.
  6. What is the role of dbutils?
    • Utilities to interact with DBFS, secrets, widgets, etc.
  7. What is Unity Catalog?
    • Centralized governance for data access, lineage, audit, and discovery across workspaces.
  8. Explain the Databricks file system (DBFS).
    • A distributed file system over object storage in Azure (ADLS/Blob).
  9. What is a Delta Lake?
    • A storage layer enabling ACID transactions on top of Data Lakes.
  10. What is a Spark job vs. stage vs. task?
    • Job = triggered action, Stage = execution plan, Task = execution unit.
  11. What are widgets in Databricks?
    • Interactive input controls for notebooks (text, dropdowns, multiselect).
  12. What is Auto-scaling in Databricks?
    • Databricks automatically adds/removes workers based on workload.
  13. What are the Databricks runtimes (LTS, Photon)?
    • Pre-built environments with specific Spark + ML/DL + GPU + performance features.
  14. Difference between Azure Data Lake and Delta Lake.
    • ADLS: Storage only.
    • Delta Lake: Storage + transactional layer + schema enforcement.
  15. What is Lakehouse architecture?
    • Combines Data Warehouse (structured, reliable) with Data Lake (scalable, raw).
  16. How is data versioned in Delta Lake?
    • Using transaction logs (_delta_log folder), supports time travel.
  17. How do you secure notebooks in Databricks?
    • Workspace ACLs, Git integration, credential passthrough, and Unity Catalog RBAC.
  18. What are cluster pools?
    • A pool of idle VMs used to reduce cluster start time.
  19. What is a workspace in Databricks?
    • Logical container for notebooks, clusters, jobs, and libraries.
  20. What is a secret scope in Databricks?
    • Secure place to store and access secrets (via Azure Key Vault or Databricks-managed).

🔹 Advanced Level (21 to 35)

  1. How do you implement CI/CD with Databricks and Azure DevOps?
    • Git integration → Repos → YAML pipeline → dbx or CLI deployment → promote to higher envs.
  2. What is Photon in Databricks?
    • Native vectorized execution engine that boosts SQL performance (C++).
  3. How do you monitor and debug Spark jobs in Databricks?
    • Use Spark UI: Jobs tab, Stages, Executors, SQL, Thread Dump analysis.
  4. What is the function of VACUUM in Delta Lake?
    • Physically deletes old files no longer referenced in Delta tables.
  5. What is the difference between append, overwrite, and merge in Delta?
    • Append: Add data, Overwrite: Replace, Merge: UPSERT.
  6. Explain data lineage in Unity Catalog.
    • Tracks data flow from source to transformations to output (tables, views, notebooks).
  7. How do you enforce data governance in Unity Catalog?
    • Grant privileges on catalog/schema/table level; use RBAC, tags, audit logs.
  8. What is the difference between Azure SQL Database and Synapse Analytics?
    • SQL DB: OLTP.
    • Synapse: OLAP with MPP for big data analytics.
  9. How do you optimize cost in Databricks?
    • Auto-terminate idle clusters, use job clusters, enable Photon, review cluster sizing, use pools.
  10. How does Delta Lake handle schema evolution?
    • Using mergeSchema or overwriteSchema options while writing.
  11. Explain checkpointing in structured streaming.
    • Saves state to resume processing after failure.
  12. How to troubleshoot skewed stages in Spark?
    • Check stage time in Spark UI, look for uneven tasks, optimize joins or repartition.
  13. How to connect Databricks with Azure Data Factory?
    • Using Web Activity or Databricks notebook activity via token authentication.
  14. Explain workspace-level vs. Unity Catalog-level security.
    • Workspace: ACL-based
    • UC: Catalog/schema/table level using RBAC.
  15. Difference between Delta Sharing and traditional APIs.
    • Open protocol to securely share live data without copying.

🔹 Scenario-Based (36 to 50)

  1. How would you set up a dev → test → prod pipeline for notebooks?
    • Use Git repos, environment-specific config, CI/CD YAML pipelines, and approval gates.
  2. A Spark job is taking 2 hours. How do you debug it?
    • Use Spark UI → Job tab → Identify long stages → Check executor skew, GC time, or wide transformations.
  3. Your Delta table queries are slow. What do you check?
    • Check Z-ordering, data size, partitions, and whether Delta caching is used.
  4. How would you manage secrets in multiple environments?
    • Create environment-specific secret scopes, integrate with Azure Key Vault.
  5. If a user sees a permission denied on a table, what do you do?
    • Check Unity Catalog grants using SHOW GRANTS or UI access control.
  6. You have hundreds of notebooks in Dev. How do you promote only a few to Prod?
    • Use branch-based Git workflow or folder-specific release pipelines.
  7. How do you automate cluster lifecycle for jobs?
    • Define job clusters in job JSON config or use cluster pools.
  8. Explain a use case where you’d prefer a job cluster over an all-purpose one.
    • For scheduled nightly ETL jobs that don’t need interactivity.
  9. What’s your backup strategy for Delta tables?
    • Use checkpointing and versioning (time travel); optionally export to ADLS.
  10. How do you audit user activity in Databricks?
    • Use system.access.audit tables or Azure Monitor diagnostic logs.
  11. A user mistakenly deletes a table. How do you recover?
    • Use Delta Lake time travel: VERSION AS OF or TIMESTAMP AS OF.
  12. Explain the benefits of Z-Ordering in Delta Lake.
    • Optimizes file skipping for faster queries by co-locating related data.
  13. You need to run ML jobs nightly. What’s your setup?
    • Schedule notebook using Job API on a job cluster with required libraries.
  14. What happens when two jobs write to the same Delta table?
    • Delta uses optimistic concurrency; one job may fail if conflict occurs.
  15. Your workspace has too many idle clusters. How do you clean it up?
    • Enforce cluster auto-termination, monitor via audit logs, tag clusters for ownership.

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments