Mohammad Gufran Jahangir September 29, 2025 0

Benchmarking SQL/Delta workloads, common pitfalls, and a practical migration checklist


Why Photon Exists

Databricks introduced Photon, a vectorized query engine built in C++ and tightly integrated with Delta Lake and Spark SQL, to solve a key problem:

  • The JVM-based execution engine (non-Photon) is powerful but has overhead.
  • SQL and Delta workloads demand low-latency, high-throughput performance at scale.

Photon takes advantage of modern CPU architectures (SIMD, cache efficiency), leading to significant performance boosts—often 2–3x faster for SQL and Delta workloads, with lower compute cost.

But the real question for admins is: When does it matter to enable Photon, and when should you stick with non-Photon?


Photon vs. Non-Photon: Benchmark Insights

Let’s break it down by workload type.

🔹 1. Delta Lake ETL / Batch Jobs

  • Photon Wins When:
    • Heavy aggregations (groupBy, sum, count).
    • Large Delta tables with billions of rows.
    • Frequent OPTIMIZE + ZORDER.
  • Non-Photon May Suffice:
    • Lightweight transformations.
    • Small/medium datasets where shuffle cost dominates.

👉 Impact: 20–40% faster ETL pipelines, especially when jobs are aggregation-heavy.


🔹 2. Interactive SQL / BI Dashboards

  • Photon Wins When:
    • Ad-hoc analytics with concurrent users.
    • Query patterns with joins + filters.
    • Dashboards powered by Databricks SQL Warehouses.
  • Non-Photon May Suffice:
    • Simple lookups or narrow scans.
    • Developer sandboxes.

👉 Impact: Latency reduction from seconds to sub-seconds, making BI dashboards snappy.


🔹 3. Machine Learning / Feature Engineering

  • Photon Wins When:
    • Feature stores rely on SQL/Delta transformations.
    • Large joins between historical features and event tables.
  • Non-Photon May Suffice:
    • Model training workloads (Python/MLlib/TensorFlow not affected).
    • UDF-heavy pipelines (Photon doesn’t accelerate custom Python logic).

👉 Impact: Faster feature prep, but no impact on training speed.


Common Pitfalls When Using Photon

Even though Photon is powerful, migration isn’t just a “flip the switch.” Here are pitfalls to watch for:

  1. Unsupported Functions / UDFs
    • Photon accelerates native SQL, but Python UDFs still run on JVM, limiting speedups.
    • Solution: Rewrite UDFs as SQL expressions or pandas UDFs where possible.
  2. Small Data Doesn’t Benefit
    • Photon shines with large datasets. For small tables (<1M rows), startup overhead may erase benefits.
  3. Cost vs. Performance Trade-Off
    • Photon clusters often deliver more queries per $—but only if utilization is high.
    • Idle Photon clusters = wasted money.
  4. Incompatible Runtimes
    • Ensure you’re on DBR 9.1+ LTS or later (Photon enabled by default in DBSQL Warehouses).
    • Mixed runtimes (old clusters + Photon SQL) can cause inconsistent performance.

Migration Checklist: Moving to Photon

Here’s a step-by-step playbook you can apply in your environment:

  1. Identify Candidate Workloads
    • Run audit on top 10 slowest SQL queries / ETL pipelines.
    • Prioritize Delta + SQL-heavy workloads.
  2. Enable Photon on Test Cluster
    • For clusters: turn on Photon acceleration.
    • For DBSQL: Photon is enabled by default (Pro/Serverless).
  3. Benchmark Side-by-Side
    • Measure query latency, job runtime, shuffle read/write, and cost per run.
    • Target workloads that show >30% improvement.
  4. Check Query Compatibility
    • Validate queries don’t rely heavily on non-supported UDFs.
    • Optimize table layouts (OPTIMIZE + ZORDER before benchmarking).
  5. Roll Out in Phases
    • Start with dev → staging → prod.
    • Monitor query response time, concurrency, and cost metrics.
  6. Govern with Policies
    • Apply cluster policies to enforce Photon for BI/SQL.
    • Keep non-Photon for workloads where it’s unnecessary (e.g., ML training).

Quick Decision Matrix

Workload TypePhoton BenefitWhen to UseWhen to Skip
Delta ETLHighLarge joins, aggregations, compactionsSmall datasets
SQL / BIVery HighDashboards, ad-hoc queriesTiny lookup queries
ML / Feature Eng.MediumFeature prep via SQLModel training, Python UDF-heavy pipelines

Closing Thoughts

Photon isn’t just a performance upgrade—it’s a cost-optimizer when applied in the right place.

  • For BI/SQL workloads → Always enable Photon.
  • For ETL pipelines → Use Photon if data size and aggregations are large.
  • For ML workloads → Photon accelerates data prep, but not training itself.

The smart path is to benchmark, not assume. By following the migration checklist, you can decide workload by workload where Photon justifies the cost—and where non-Photon is sufficient.


✨ Next Step for You:
Pick one high-cost workload today. Benchmark it with Photon enabled. If runtime drops 30–50%, you’ve found your first win.


Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments