Mohammad Gufran Jahangir February 16, 2026 0

Table of Contents

Quick Definition (30–60 words)

Apache Pulsar is a distributed, cloud-native pub/sub messaging and streaming platform offering persistent, durable topics with partitioning and geo-replication. Analogy: Pulsar is like a global postal system with guaranteed delivery, routing, and regional offices. Formally: a multi-tenant, topic-centric messaging system with decoupled storage and serving layers.


What is Pulsar?

What it is:

  • A distributed messaging and streaming platform supporting pub/sub, streaming, and queue semantics with persistent storage and at-least-once delivery guarantees by default.
  • Multi-tenancy, topic-level isolation, and built-in geo-replication are core design features. What it is NOT:

  • Not just a lightweight broker; it includes a distributed storage layer and bookkeeper-like persistence model.

  • Not purely an event bus; it is a full streaming platform with replay, partitioning, and retention controls.

Key properties and constraints:

  • Decoupled serving and storage: brokers handle routing and consumers while a separate storage layer persists messages.
  • Partitioned topics for parallelism and ordering per partition.
  • Native support for geo-replication and tenant isolation.
  • Strong consistency within partitions for ordered delivery with configurable delivery semantics.
  • Operational complexity: requires orchestration and storage management for production-grade deployments.
  • Latency and throughput trade-offs depend on deployment topology and storage configuration.

Where it fits in modern cloud/SRE workflows:

  • Ingest layer for event-driven architectures and data pipelines.
  • Buffering and durable message transport between microservices, analytics, and downstream sinks.
  • Replacement for legacy message queues when persistence, replay, or multi-region replication is required.
  • Integrates with Kubernetes for cloud-native deployments and with serverless functions for event-driven compute.

Text-only diagram description (visualize):

  • Imagine three layers stacked vertically: Top layer clients (producers and consumers) connect to a middle layer of stateless brokers. The brokers route messages and forward persistence operations to the bottom layer of bookie nodes that store segments. A metadata layer holds topic and ledger metadata and a manager coordinates replication and lifecycle. Geo-replication copies ledgers across regions asynchronously. Monitoring and operators sit alongside interacting via APIs and control plane.

Pulsar in one sentence

A cloud-native, multi-tenant streaming and messaging platform that decouples serving and storage to provide durable, replayable, and geo-replicated event delivery.

Pulsar vs related terms (TABLE REQUIRED)

ID Term How it differs from Pulsar Common confusion
T1 Kafka Broker-focused cluster with log storage inside brokers Both are streaming systems
T2 RabbitMQ Broker with complex routing and in-memory focus Often used for short-lived messages
T3 BookKeeper Storage layer concept used by Pulsar BookKeeper is storage not full messaging
T4 Event Mesh Architectural pattern across networks Pulsar can implement this but is a product
T5 Stream Processing Compute over streams Pulsar provides streams not necessarily processing
T6 Message Queue Simple queue semantics Pulsar supports more features and persistence
T7 Service Bus Enterprise integration suite Pulsar focuses on events not full ESB features
T8 Kinesis Managed AWS streaming service Kinesis is cloud managed; Pulsar can be self-hosted
T9 Managed Pulsar Cloud-hosted Pulsar offering Implementation and SLA vary by vendor
T10 Schema Registry Type management for events Pulsar includes schema support natively

Row Details (only if any cell says “See details below”)

  • None.

Why does Pulsar matter?

Business impact:

  • Revenue: Durable, high-throughput messaging reduces lost events and supports real-time revenue features like personalization or trading.
  • Trust: Geo-replication and persistence lower data loss risk and contractual SLA exposure.
  • Risk: Complex operations can increase vendor or ops risk if not automated.

Engineering impact:

  • Incident reduction: Proper isolation and retention can reduce production data loss incidents.
  • Velocity: Developers build event-driven features faster when replay and durability are available.
  • Complexity adds overhead: introduces storage management and cross-region consistency tasks.

SRE framing:

  • SLIs/SLOs: Throughput, publish latency, consume lag, write durability, and replication success rate are primary SLIs.
  • Error budgets: Use delivery failure rates and replication lag to define burn rates.
  • Toil/on-call: Operational automation around broker scaling, ledger compaction, and bookie replacements reduces toil.

3–5 realistic “what breaks in production” examples:

  1. Broker crash causing increased consumer reconnects and partition leadership churn.
  2. Bookie disk failure causing ledger under-replicated states and slow recovery.
  3. Misconfigured retention leading to unexpected data deletion and failed replays.
  4. Network partition between regions causing geo-replication backlog and increased latency.
  5. Tenant or topic hot-spot causing broker CPU exhaustion and downstream pipeline slowdown.

Where is Pulsar used? (TABLE REQUIRED)

ID Layer/Area How Pulsar appears Typical telemetry Common tools
L1 Edge/Ingress Ingest buffer for telemetry and events Ingest rate and latency See details below: L1
L2 Network/Transport Message routing and fanout Broker CPU and network Prometheus Grafana
L3 Service/Application Event bus between microservices Consumer lag and errors OpenTelemetry
L4 Data/Analytics Streaming source for pipelines Throughput and retention Flink Spark
L5 Cloud/Kubernetes Deployed as stateful services Pod restarts and PVC usage Helm Operators
L6 Serverless/PaaS Trigger for functions Invocation latency and retries Functions platform
L7 CI/CD & Ops Integration tests and pipelines Test publish/consume success CI tools
L8 Observability/Security Audit events and telemetry Audit logs and access denials SIEM Prometheus

Row Details (only if needed)

  • L1: Edge gateways publish bursts from devices, use batching and TLS termination.
  • L6: Serverless platforms subscribe to topics to trigger short-lived compute.

When should you use Pulsar?

When it’s necessary:

  • You need durable, replayable streams with geo-replication across regions.
  • Multi-tenancy and strict isolation between teams or customers are required.
  • High throughput with long retention windows and large-scale partitioning is needed.

When it’s optional:

  • Small-scale pub/sub inside a single region with simple queue semantics.
  • Very low-latency in-memory pub/sub where persistence is not required.

When NOT to use / overuse it:

  • For simple point-to-point RPC; lightweight message brokers may be simpler.
  • When organizational maturity cannot support operating storage and replication at scale.
  • For tiny workloads where managed cloud queues are significantly cheaper.

Decision checklist:

  • If you need durability and replay and expect >100k msgs/s -> use Pulsar.
  • If you need simple ephemeral messaging with minimal ops -> consider SaaS queue.
  • If you need serverless triggers and prefer managed operations -> evaluate managed Pulsar offerings.

Maturity ladder:

  • Beginner: Single-region, few topics, limited partitions, operator-managed.
  • Intermediate: Multiple namespaces, automated Helm/Operator deployments, basic SLOs.
  • Advanced: Multi-region geo-replication, dynamic partitioning, automated scaling, full SRE on-call.

How does Pulsar work?

Components and workflow:

  • Producer: Sends messages to a topic via broker endpoint.
  • Broker: Stateless routing layer that handles client connections and enforces topic policies.
  • Bookkeeper (bookies): Storage nodes that persist ledgers (message segments) across replicas.
  • Metadata store: Stores topic metadata, ledger locations, and cluster configuration.
  • ZooKeeper or metadata service: Historically used for coordination; specifics vary with Pulsar versions (metadata implementations evolved).
  • Consumers: Pull or receive messages with acknowledgment semantics; can replay from offsets.

Data flow and lifecycle:

  1. Producer sends messages to broker.
  2. Broker validates and routes request, assigns ledger/segment.
  3. Broker writes message to bookies; bookies persist data across replicas.
  4. Once write quorum returns, broker acknowledges producer.
  5. Consumers fetch messages; after processing they ACK, enabling retention/compaction policies to free storage.
  6. Data older than retention or compacted keys get removed; ledgers are archived or deleted as configured.

Edge cases and failure modes:

  • Under-replicated ledgers when bookies fail; need quick re-replication.
  • Broker restarts requiring client reconnection and potential short consumer stalls.
  • Hot partitions causing skewed throughput across consumers.
  • Geo-replication backlog growth during network partitions.

Typical architecture patterns for Pulsar

  • Ingest and Fan-out: Producers publish telemetry; multiple consumer groups process in parallel. Use when many downstream consumers need the same events.
  • Queue/Work Distribution: Use exclusive or shared subscriptions for worker queues with acknowledgement semantics. Use when distributing tasks across workers.
  • Stream Processing with Connectors: Pulsar as source to stream processing engines (Flink) and sinks (object stores). Use for ETL and analytics.
  • Event Sourcing: Store events for domain-driven designs with retention and replay. Use when auditability and replay are required.
  • Geo-Replicated Global Bus: Multiple regions with replication policies for disaster recovery and low-latency reads. Use for global applications requiring redundancy.

Failure modes & mitigation (TABLE REQUIRED)

ID Failure mode Symptom Likely cause Mitigation Observability signal
F1 Broker crash Clients reconnecting frequently Broker OOM or CPU spike Scale brokers and tune JVM Broker restarts metric
F2 Bookie disk full Writes failing or slow Retention misconfig or disk leak Add storage or prune retention Disk usage and write errors
F3 Under-replicated ledgers Replication count below target Bookie failure Trigger re-replication and replace bookie Underreplicated ledger count
F4 Geo-replication lag Remote consumers see old data Network partition or slow link Increase replication threads or bandwidth Replication backlog
F5 Topic hot-spot High latency on a partition Uneven partitioning Repartition topic or add consumers Per-partition throughput
F6 Authentication failure Clients rejected Misconfigured auth or expired token Rotate tokens and update configs Auth failure logs
F7 Metadata corruption Topics inaccessible Metadata store failure Restore metadata backup; prevent writes Metadata error events
F8 Excessive retention cost Unexpected storage bills Long retention with high volume Implement compaction and TTL Retained bytes and cost metrics

Row Details (only if needed)

  • F3: Under-replicated ledgers can accumulate if recovery is slow. Steps: monitor underreplicated count, replace failed nodes, run rebalance.
  • F4: Replication backlog after long partition can be huge. Steps: throttle producer, increase replication concurrency, apply temporary retention increases.
  • F5: Repartitioning may require downtime or careful topic migration; use partitioned topics and measure consumer lag during migration.

Key Concepts, Keywords & Terminology for Pulsar

Glossary (40+ terms):

Acknowledge — Client confirms message processed — Enables retention/amendment — Missing ACKs cause redelivery
Auto Topic Creation — Automatic topic provisioning — Simplifies dev experience — Can cause tag proliferation
Backlog — Unconsumed messages stored — Reflects retention and lag — Growing backlog increases storage cost
Broker — Stateless request handler — Routes messages to storage — Broker OOM affects availability
Bookie — Storage node in BookKeeper model — Persists ledgers — Disk failures lead to under-replication
Compaction — Key-based retention optimization — Reduces storage for keyed events — Misusing may lose history
Consumer — Reads messages from topic — Multiple modes exist — Misconfigured mode causes duplicates
Deduplication — Remove duplicate publishes — Ensures idempotence for producers — Requires message ID tracking
Delivery Semantics — At-least-once or effectively-once patterns — Defines client guarantees — Choosing wrong leads to duplicates
Geo-Replication — Replicates topics across regions — Enables DR and locality — Network issues cause backlog
Ledger — Storage segment of messages — Unit of persistence — Corrupt ledgers block topic reads
Namespace — Multi-tenant grouping of topics — Used for isolation and quotas — Misconfiguration exposes resources
Partitioned Topic — Topic split into partitions — Enables parallelism — Too many partitions cause overhead
Producer — Client sending messages — Controls batching and routing — Bad batch settings increase latency
Retention — How long messages are kept — Determines replayability — Long retention increases cost
Schema — Type constraints for messages — Provides compatibility guarantees — Schema evolution needs care
Subscription — Consumer group concept — Shared or exclusive modes — Wrong mode breaks delivery pattern
Subscription Position — Earliest, latest, or timestamp — Controls where consumers start — Misuse causes missing data
Sync Replication — Synchronous replication mode — Stronger durability — Slower writes
Async Replication — Asynchronous replication mode — Faster writes, eventual replication — Risk of data lag
BookKeeper Ensemble — Group of bookies storing ledger replicas — Ensemble size impacts durability — Too small risks data loss
Metadata Store — Stores topic and ledger metadata — Critical for cluster health — Corrupt metadata prevents operations
Znode — Metadata node entry (coordination store) — Used for locking and config — Zookeeper specifics may vary
TTL — Time to live for messages — Automatic deletion threshold — Wrong TTL deletes needed data
Message ID — Unique identifier per message — Needed for ack and replay — Mismanaging breaks dedupe
End-to-end Latency — Time from publish to consume — Key SLI — High variability breaks SLAs
Schema Registry — Manages schemas for topics — Ensures compatibility — Version misalignment breaks deserialization
Backpressure — Flow control under load — Protects brokers and consumers — Ignoring leads to OOM or stalls
Offload — Archive old ledgers to long-term storage — Saves disk space — Offload failures risk data loss
Retention Policy — Rules for storage retention — Controls cost vs replay — Inconsistent policy causes surprises
Consumer Lag — Difference between head and processed offset — Measures freshness — High lag indicates bottleneck
Partition Key — Used to route events to partition — Preserves ordering for a key — Using high-cardinality keys causes hotspots
Subscription Types — Exclusive, shared, failover — Determines consumer behavior — Wrong type alters semantics
Message TTL — Per-message expiry control — Auto-removes stale events — Misuse deletes live data
Load Manager — Balances topics across brokers — Prevents hotspots — Incorrect tuning causes imbalance
Pulsar Functions — Lightweight compute for event processing — Useful for simple transforms — Not a replacement for full stream engines
Connector — Prebuilt source/sink integration — Speeds pipeline building — Wrong connector causes duplication
Auth/Authorization — Security model for access control — Protects topics and namespaces — Weak config risks exposure
TLS — Encryption for client and internode traffic — Required for confidential data — Certificates must be rotated
Metrics Exporter — Exposes Pulsar metrics — Essential for SRE — Missing exporters blind ops
Schema Validation — Runtime message type checking — Prevents consumer errors — Strict validation blocks compatible producers
Consumer Groups — Logical grouping of consumers — Scale reading work — Misuse creates duplicate processing


How to Measure Pulsar (Metrics, SLIs, SLOs) (TABLE REQUIRED)

ID Metric/SLI What it tells you How to measure Starting target Gotchas
M1 Publish Latency Time producer waits for ack 95th percentile of publish time <100 ms Peaks on GC and disk spikes
M2 End-to-End Latency Time from publish to ACK by consumer 95th percentile from publish to consume <500 ms Depends on consumer processing
M3 Consumer Lag Messages behind head per subscription Consumer lag gauge <10k msgs or <30s High retention increases lag gauge
M4 Under-replicated Ledgers Count of under-replicated ledgers Count from metadata 0 Can spike after failures
M5 Broker CPU Broker CPU utilization Avg CPU across brokers <70% JVM GC can cause short spikes
M6 Bookie Disk Usage Storage used per bookie Percent disk used <70% Sudden growth on retention change
M7 Replication Backlog Messages pending to replicate Messages or bytes pending Near 0 Network partitions show high backlog
M8 Publish Error Rate Failed publishes per second Errors / total publishes <0.1% Transient auth issues inflate rate
M9 Consumer Error Rate Consumer processing failures Failures per sec <0.5% Deserialization errors common cause
M10 Broker Restarts Number of restarts per day Restart count 0 Rolling upgrades may increase count
M11 Message Loss Incidents Confirmed lost messages Post-incident audit 0 Hard to detect without end-to-end checks
M12 Schema Rejection Rate Messages rejected by schema Rejected count / total <0.1% Version mismatches cause rejections
M13 Connection Errors Failed client connects Error count Low Credential expiry spikes this
M14 Topic Hotspots Top topics by throughput Top N list Identify and redistribute High cardinality causes hotspots

Row Details (only if needed)

  • None.

Best tools to measure Pulsar

Tool — Prometheus + Grafana

  • What it measures for Pulsar: Broker, bookie, and proxy metrics, JVM, latency and lag.
  • Best-fit environment: Kubernetes and on-prem clusters.
  • Setup outline:
  • Export Pulsar metrics via Prometheus exporter.
  • Configure scrape jobs for brokers and bookies.
  • Create Grafana dashboards for key metrics.
  • Set alerting rules in Prometheus or Alertmanager.
  • Strengths:
  • Flexible query language and dashboarding.
  • Widely adopted in cloud-native stacks.
  • Limitations:
  • Requires maintenance and scaling for high-cardinality metrics.
  • Long-term storage needs extra components.

Tool — OpenTelemetry

  • What it measures for Pulsar: Traces for publish and consume operations.
  • Best-fit environment: Distributed applications needing tracing.
  • Setup outline:
  • Instrument producers and consumers.
  • Capture spans for publish/receive operations.
  • Export to tracing backend.
  • Strengths:
  • Standardized tracing across services.
  • Useful for root-cause analysis.
  • Limitations:
  • Sampling decisions affect visibility.
  • Client instrumentation needed.

Tool — Pulsar Manager / Cluster Manager UI

  • What it measures for Pulsar: Cluster state, topic and subscription overview.
  • Best-fit environment: Administrative operations.
  • Setup outline:
  • Deploy manager alongside cluster.
  • Configure access control.
  • Use UI for topic management and metrics.
  • Strengths:
  • Operational visibility into namespaces and topics.
  • Easier for non-SRE users.
  • Limitations:
  • Not a full monitoring solution.
  • Scalability varies with cluster size.

Tool — Fluentd / Log Aggregator

  • What it measures for Pulsar: Client and broker logs for errors and audit.
  • Best-fit environment: Centralized logging.
  • Setup outline:
  • Ship logs from brokers and bookies to central sink.
  • Parse and tag pulsar logs.
  • Create alerts for error patterns.
  • Strengths:
  • Rich context for debugging.
  • Searchable history.
  • Limitations:
  • Requires log volume storage and parsing rules.
  • Noise without filtering.

Tool — Stream Processing Metrics (Flink/Spark)

  • What it measures for Pulsar: End-to-end processing latency and throughput in stream jobs.
  • Best-fit environment: Pipelines using stream processing engines.
  • Setup outline:
  • Instrument job metrics and export to monitoring.
  • Correlate with Pulsar metrics.
  • Strengths:
  • Shows pipeline health tied to Pulsar.
  • Useful for backpressure detection.
  • Limitations:
  • Requires integration work.
  • Job-level metrics not equal to broker health.

Recommended dashboards & alerts for Pulsar

Executive dashboard:

  • Panels: Cluster health summary, total throughput, cross-region replication status, storage used, incident count.
  • Why: High-level view for stakeholders and execs.

On-call dashboard:

  • Panels: Broker CPU and GC, bookie disk usage, under-replicated ledgers, replication lag, publish error rate, consumer lag per critical namespace.
  • Why: Rapid triage and remediation for SREs.

Debug dashboard:

  • Panels: Per-topic throughput, per-partition latency, consumer ack rates, connection errors, recent broker logs, per-node JVM metrics.
  • Why: Deep troubleshooting for engineers during incidents.

Alerting guidance:

  • Page (urgent): Sustained under-replicated ledgers, replication backlog exceeding threshold for >5 minutes, majority broker nodes down.
  • Ticket (non-urgent): Single broker restart, brief spike in publish errors <5 minutes, disk usage approaching threshold.
  • Burn-rate guidance: If SLO error budget burned at >50% in 6 hours, escalate and run remediation playbook.
  • Noise reduction: Deduplicate alerts using grouping keys, suppress known noisy windows during rolling upgrades, apply rate-limited notifications.

Implementation Guide (Step-by-step)

1) Prerequisites: – Capacity planning for expected throughput and retention. – Kubernetes cluster or VM pool sized for brokers and bookies. – Storage planning for bookies with IOPS and disk durability considerations. – Security policies: TLS, authentication, authorization plans. – Backup and metadata snapshot strategies.

2) Instrumentation plan: – Export core metrics, implement tracing for producers and consumers, capture logs and audit events. – Define SLIs and export them as metrics.

3) Data collection: – Set up Prometheus scrapers, logging pipeline, and tracing collectors. – Ensure metrics retention for SLO calculation.

4) SLO design: – Define SLOs for publish latency, end-to-end latency, and availability. – Set error budgets and remediation thresholds.

5) Dashboards: – Build executive, on-call, and debug dashboards. – Include per-namespace and per-topic drilldowns.

6) Alerts & routing: – Define alert thresholds and who to page. – Implement dedupe and silence during maintenance.

7) Runbooks & automation: – Create runbooks for common failures like bookie replacement, re-replication, and partition rebalance. – Automate scaling and repairs where possible.

8) Validation (load/chaos/game days): – Run load tests with realistic producer patterns. – Perform chaos tests for broker and bookie failures. – Run DR tests for geo-replication failover.

9) Continuous improvement: – Weekly reviews of alerts and incidents. – Quarterly review of retention and cost. – Update SLOs based on observed behavior.

Pre-production checklist:

  • Capacity and retention validated with load tests.
  • Security policies tested with sample clients.
  • Observability stack deployed and dashboards ready.
  • Runbooks created and reviewed.

Production readiness checklist:

  • Alerting and escalation paths tested.
  • Automatic recovery playbooks implemented.
  • Backup and restore validated.
  • On-call rotation and runbook access confirmed.

Incident checklist specific to Pulsar:

  • Confirm scope: affected namespaces/topics.
  • Check under-replicated ledgers and replication backlog.
  • Verify broker and bookie node health.
  • If replication lag high, throttle producers and scale replication.
  • Notify stakeholders and run remediation playbook.

Use Cases of Pulsar

Provide 8–12 use cases:

1) Telemetry Ingestion – Context: High-volume device telemetry. – Problem: Durable ingestion with replay for analytics. – Why Pulsar helps: High throughput, retention, and fan-out to multiple consumers. – What to measure: Ingest rate, publish latency, backlog. – Typical tools: Prometheus, Flink.

2) Event Sourcing for Microservices – Context: Business events stored as source of truth. – Problem: Need ordering and replay for rebuilds. – Why Pulsar helps: Partitioned topics and durable storage. – What to measure: End-to-end latency, retention usage. – Typical tools: Schema registry, stream processors.

3) Stream Processing Pipeline – Context: ETL from events to data lake. – Problem: Fault-tolerant pipeline with exactly-once semantics. – Why Pulsar helps: Durable source with connectors and integration. – What to measure: Processing latency, consumer failures. – Typical tools: Flink, connectors.

4) Job Queue / Worker Pool – Context: Background task processing. – Problem: Reliable distribution and retry semantics. – Why Pulsar helps: Shared subscriptions, ack policies. – What to measure: Task completion rates, redelivery counts. – Typical tools: Worker frameworks, monitoring.

5) Geo-Replicated Messaging – Context: Global application with regional failover. – Problem: Data locality and disaster recovery. – Why Pulsar helps: Built-in replication policies. – What to measure: Replication lag, cross-region latency. – Typical tools: SIEM, cross-region monitors.

6) Audit Trail and Compliance – Context: Regulatory logging and immutable audit. – Problem: Retention and access controls. – Why Pulsar helps: Durable retention and namespaces for isolation. – What to measure: Retention enforcement, schema rejections. – Typical tools: Archival offload to object storage.

7) Notification and Fan-out – Context: Multi-channel notifications from events. – Problem: Delivering same event to many consumers. – Why Pulsar helps: Efficient fan-out with multiple subscriptions. – What to measure: Delivery success rate, subscriber latency. – Typical tools: Functions for transformation.

8) ML Feature Streaming – Context: Real-time feature updates for models. – Problem: Low-latency, consistent updates across replicas. – Why Pulsar helps: Streaming guarantees and replay for feature regeneration. – What to measure: End-to-end latency, schema compatibility. – Typical tools: Feature stores, stream processors.

9) CI/CD Event Bus – Context: Build and deploy events across pipelines. – Problem: Reliable orchestration events and tracing. – Why Pulsar helps: Durable and multi-tenant topics for pipelines. – What to measure: Event delivery rates, processing failures. – Typical tools: CI/CD systems and event handlers.

10) Financial Trading Events – Context: High-frequency market events. – Problem: Ordering, durability, and replication for compliance. – Why Pulsar helps: Partitioning for ordering and geo-replication. – What to measure: Publish latency, message loss incidents. – Typical tools: Low-latency hardware and tuned bookies.


Scenario Examples (Realistic, End-to-End)

Scenario #1 — Kubernetes-based Event Pipeline

Context: E-commerce platform deployed on Kubernetes processes order events.
Goal: Build a scalable event pipeline for order ingestion, fulfillment, and analytics.
Why Pulsar matters here: Provides durable ingestion, replay for analytics, and namespace isolation per environment.
Architecture / workflow: Producers (order API pods) -> Pulsar brokers (K8s statefulset) -> Bookies (statefulset with PVCs) -> Consumers: fulfillment workers and analytics Flink job.
Step-by-step implementation:

  1. Provision K8s cluster with storage class for bookies.
  2. Deploy Pulsar Operator and create cluster CR.
  3. Configure namespaces and auth policies for services.
  4. Create partitioned topic orders with partition key customer_id.
  5. Deploy consumers and Flink job to read from orders topic.
  6. Set up Prometheus scraping for metrics and Grafana dashboards. What to measure: Publish latency, consumer lag, broker restarts, disk usage.
    Tools to use and why: Kubernetes Operator for lifecycle; Prometheus/Grafana for metrics; Flink for processing.
    Common pitfalls: PVC IOPS insufficient, leading to disk write latency.
    Validation: Load test with synthetic order traffic and monitor SLOs.
    Outcome: Scalable ingestion with replay ability for analytics.

Scenario #2 — Serverless / Managed-PaaS Integration

Context: SaaS product using managed functions for event-driven compute.
Goal: Trigger serverless functions from events and guarantee at-least-once invocation.
Why Pulsar matters here: Durable message delivery with function integrations and retries.
Architecture / workflow: Producers -> Pulsar brokers (managed or hosted) -> Functions platform (subscriber) -> Downstream services.
Step-by-step implementation:

  1. Use managed Pulsar offering or managed connectors to surface topics.
  2. Configure function triggers with appropriate subscription type.
  3. Enable auth tokens and TLS for secure connectivity.
  4. Monitor invocation failures and set dead-letter topic for failed messages. What to measure: Invocation latency, function failure rate, DLQ size.
    Tools to use and why: Managed Pulsar, Functions runtime, log aggregator.
    Common pitfalls: Cold starts in serverless combine with message backlog causing timeouts.
    Validation: Simulate burst traffic and verify retries to DLQ.
    Outcome: Reliable serverless triggers with durable retry handling.

Scenario #3 — Incident Response and Postmortem

Context: Production incident where bookie failures caused delayed processing.
Goal: Triage and resolve replication backlog, prevent recurrence.
Why Pulsar matters here: Storage failures impact data availability and SLOs.
Architecture / workflow: Normal cluster with replication across two regions; bookie failure in primary region.
Step-by-step implementation:

  1. Identify under-replicated ledgers from metrics.
  2. Replace failed bookies and initiate re-replication.
  3. Throttle producers if replication backlog threatens storage.
  4. Create incident ticket and execute runbook for bookie replacement. What to measure: Under-replicated ledgers, replication backlog, publish errors.
    Tools to use and why: Monitoring dashboards, logs, runbooks.
    Common pitfalls: Missing automated replacement caused manual delay.
    Validation: Postmortem with timeline, root cause, and mitigation plan.
    Outcome: Recovered replication, SLO review, automated remediation added.

Scenario #4 — Cost vs Performance Trade-off

Context: High-volume event retention causing high storage cost.
Goal: Reduce storage cost while preserving replay for 30-day window.
Why Pulsar matters here: Retention and offload policies directly impact cost.
Architecture / workflow: Pulsar cluster with offload to object storage.
Step-by-step implementation:

  1. Measure retention and growth trends.
  2. Implement ledger offload to cheaper object storage after 7 days.
  3. Use compaction for key-based topics to reduce size.
  4. Monitor read latency for offloaded segments and adjust SLOs. What to measure: Retained bytes, offload success rate, read latency for offloaded data.
    Tools to use and why: Storage analytics, object store metrics, Pulsar offload monitoring.
    Common pitfalls: Offload retrieval latency causing consumer slowdowns.
    Validation: Cost analysis and read performance test.
    Outcome: Lower cost with acceptable read latency trade-off.

Common Mistakes, Anti-patterns, and Troubleshooting

List 20 mistakes with symptom -> root cause -> fix:

  1. Symptom: High publish latency -> Root cause: Broker CPU or GC -> Fix: Tune JVM, increase brokers.
  2. Symptom: Consumer duplicate processing -> Root cause: At-least-once semantics and missing idempotency -> Fix: Implement idempotent consumers or dedupe.
  3. Symptom: Under-replicated ledgers -> Root cause: Bookie failure -> Fix: Replace bookie and run re-replication.
  4. Symptom: Unexpected data deletion -> Root cause: Misconfigured retention or TTL -> Fix: Review and correct retention policies.
  5. Symptom: Hot partition with high latency -> Root cause: High-cardinality partition keys not used correctly -> Fix: Repartition or change partition key.
  6. Symptom: Produce errors during upgrades -> Root cause: Rolling upgrade without client coordination -> Fix: Apply client retry/backoff and coordinate upgrade windows.
  7. Symptom: Metric gaps in monitoring -> Root cause: Exporter misconfiguration -> Fix: Ensure metric scrape targets and endpoints are correct.
  8. Symptom: Excessive storage cost -> Root cause: Long retention and no offload -> Fix: Implement offload and compaction.
  9. Symptom: Connection auth failures -> Root cause: Token expiry or incorrect certs -> Fix: Rotate tokens and update clients.
  10. Symptom: Sudden broker restarts -> Root cause: OOM/killed processes -> Fix: Increase memory and tune heap.
  11. Symptom: High consumer lag -> Root cause: Slow consumer processing or backpressure -> Fix: Scale consumers and optimize processing.
  12. Symptom: Geo-replication backlog -> Root cause: Network partition -> Fix: Restore network, increase replication throughput.
  13. Symptom: Topic creation abuse -> Root cause: Auto-create enabled for dev clients -> Fix: Disable auto-create and enforce namespace quotas.
  14. Symptom: Schema incompatibility errors -> Root cause: Uncoordinated schema changes -> Fix: Enforce schema evolution rules.
  15. Symptom: Noisy alerts -> Root cause: Low thresholds and no dedupe -> Fix: Tune thresholds, apply grouping.
  16. Symptom: Playback failures -> Root cause: Missing ledgers after retention expiry -> Fix: Adjust retention or offload strategy.
  17. Symptom: Slow ledger compaction -> Root cause: Compaction running on overloaded nodes -> Fix: Schedule compaction during low traffic.
  18. Symptom: Incomplete backups -> Root cause: Offload failures -> Fix: Monitor offload success and retry on failure.
  19. Symptom: Debugging blind spots -> Root cause: No tracing or correlation IDs -> Fix: Add OpenTelemetry tracing and message IDs.
  20. Symptom: Security exposure -> Root cause: Open ports and weak auth -> Fix: Enforce TLS and RBAC.

Observability pitfalls (at least 5 included above):

  • Missing tracing causes long MTTR -> add tracing.
  • High-cardinality metrics overload Prometheus -> reduce cardinality or use remote write.
  • No per-topic metrics hides hotspots -> enable per-topic telemetry.
  • Ignoring log parsing loses error context -> centralize and parse logs.
  • Not measuring backlog hides replication issues -> monitor backlog metrics.

Best Practices & Operating Model

Ownership and on-call:

  • Define cluster ownership: platform team owns infrastructure; app teams own topics and consumers.
  • Ensure on-call rotations include platform engineers with runbook access.

Runbooks vs playbooks:

  • Runbooks: Step-by-step operational commands for routine failures.
  • Playbooks: High-level decision trees for escalations and cross-team coordination.

Safe deployments:

  • Canary: Deploy brokers or config changes to a subset of brokers first.
  • Rollback: Keep automated rollback for deployments failing health checks.

Toil reduction and automation:

  • Automate bookie replacement, scaling, and offload verification.
  • Implement automatic alerts for under-replication and replication lag.

Security basics:

  • Enforce TLS for client and internode traffic.
  • Use token-based authentication and RBAC at namespace/topic level.
  • Rotate credentials and certificates on a schedule.

Weekly/monthly routines:

  • Weekly: Check health dashboards, under-replicated ledgers, storage growth.
  • Monthly: Review retention policies and access logs; run disaster recovery drills.

What to review in postmortems related to Pulsar:

  • Timeline of broker and bookie events.
  • Metrics around publish and consume latencies.
  • Root cause in storage or network.
  • Actions taken and automation added.
  • Impact to SLOs and customers.

Tooling & Integration Map for Pulsar (TABLE REQUIRED)

ID Category What it does Key integrations Notes
I1 Monitoring Collects metrics and alerts Prometheus Grafana Use exporters on brokers
I2 Logging Aggregates logs Fluentd ELK Parse pulsar logs for errors
I3 Tracing Traces publish and consume OpenTelemetry Instrument producers and consumers
I4 Stream Processing Stateful processing of streams Flink Spark Use Pulsar connectors
I5 Functions Lightweight event compute Pulsar Functions Good for simple transforms
I6 CI/CD Integration tests and deployment Helm Operators Use operator for k8s lifecycle
I7 Security Auth and RBAC TLS OAuth Enforce per-namespace policies
I8 Backup Offload to object storage S3 GCS Validate restores regularly
I9 Schema Management Enforce message schemas Built-in registry Version control for schemas
I10 Managed Services Hosted Pulsar offerings Cloud IAM Vendor SLAs vary

Row Details (only if needed)

  • None.

Frequently Asked Questions (FAQs)

What is Pulsar used for?

Pulsar is used for durable messaging, streaming, ingestion, and multi-consumer fan-out with replay and geo-replication.

Is Pulsar a replacement for Kafka?

It can be for many streaming use cases; differences include architecture and operational model.

Does Pulsar guarantee exactly-once delivery?

Not inherently; it provides at-least-once semantics. Exactly-once often requires upstream and downstream idempotence or processing frameworks.

How does Pulsar store messages?

Messages are persisted to ledger segments in bookies; metadata tracks ledger locations.

Can Pulsar run on Kubernetes?

Yes; operators and Helm charts support Kubernetes deployments.

Is Pulsar multi-tenant?

Yes; namespaces and tenant constructs provide multi-tenancy.

What are common causes of message loss?

Misconfigured retention, under-replicated ledgers, and improper offload setups.

How to handle schema evolution?

Use Pulsar’s schema registry and enforce compatibility rules.

How to secure Pulsar?

Use TLS, token-based auth, and RBAC for namespace/topic access.

How to scale Pulsar for high throughput?

Increase brokers and partitions, tune bookie storage and network, and partition topics appropriately.

Can Pulsar be managed as a service?

Yes; managed Pulsar offerings exist but SLA and features vary by vendor.

How to monitor Pulsar health?

Collect broker, bookie, and replication metrics; monitor under-replicated ledgers and consumer lag.

What is a bookie?

A storage node in the BookKeeper model that persists message ledgers.

How to prevent topic sprawl?

Disable auto-creation or enforce quotas per namespace.

What happens during metadata store failure?

Topics may become inaccessible; restore from metadata backups.

How to test Pulsar in staging?

Run load tests replicating production traffic patterns and perform chaos tests on brokers/bookies.

How to keep costs under control?

Use offload, compaction, reasonable retention, and monitor stored bytes.

What are best SLOs for Pulsar?

Start with publish latency SLOs and consumer lag targets; tune per-application needs.


Conclusion

Apache Pulsar is a robust, cloud-native streaming and messaging solution that fits modern event-driven architectures when durability, replay, and multi-region replication matter. Operating Pulsar well requires investment in observability, automation, and SRE practices to manage storage and replication complexity.

Next 7 days plan:

  • Day 1: Run capacity and retention estimation for target workloads.
  • Day 2: Deploy a dev Pulsar cluster and enable Prometheus metrics.
  • Day 3: Create sample topics and instrument a producer and consumer with tracing.
  • Day 4: Implement SLOs and dashboards for publish latency and consumer lag.
  • Day 5: Run a load test and observe bottlenecks.
  • Day 6: Create runbooks for core failures and test a failover.
  • Day 7: Review cost and retention policy; draft production readiness checklist.

Appendix — Pulsar Keyword Cluster (SEO)

  • Primary keywords
  • pulsar messaging
  • apache pulsar
  • pulsar streaming
  • pulsar architecture
  • pulsar tutorial

  • Secondary keywords

  • pulsar vs kafka
  • pulsar broker
  • pulsar bookie
  • pulsar topics
  • pulsar geo-replication

  • Long-tail questions

  • how to deploy pulsar on kubernetes
  • pulsar best practices for production
  • how does pulsar store messages
  • pulsar monitoring and observability
  • pulsar schema registry usage

  • Related terminology

  • bookkeeper ledger
  • partitioned topic
  • namespace quotas
  • pulsar functions
  • retention policy
  • replication backlog
  • consumer lag
  • publish latency
  • under-replicated ledger
  • offload to object storage
  • message id tracking
  • schema evolution
  • token authentication
  • TLS encryption
  • Prometheus Grafana
  • OpenTelemetry tracing
  • Flink connector
  • serverless triggers
  • multi-tenant isolation
  • JVM tuning
  • ledger compaction
  • runbook automation
  • canary deployments
  • chaos testing
  • cost optimization
  • topic partitioning
  • hot partition mitigation
  • DLQ dead-letter topic
  • message deduplication
  • producer batching
  • subscription types
  • exclusive subscription
  • shared subscription
  • failover subscription
  • schema validation
  • data replay
  • audit trail
  • backup and restore
  • storage offload
  • message TTL
  • metadata store
  • operator lifecycle
  • managed pulsar service
  • event sourcing with pulsar
  • real-time analytics
  • ingest pipeline
  • notifications and fanout
  • feature streaming
  • CI CD event bus
Category: Uncategorized
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments