Quick Definition (30–60 words)
Apache Pulsar is a distributed, cloud-native pub/sub messaging and streaming platform offering persistent, durable topics with partitioning and geo-replication. Analogy: Pulsar is like a global postal system with guaranteed delivery, routing, and regional offices. Formally: a multi-tenant, topic-centric messaging system with decoupled storage and serving layers.
What is Pulsar?
What it is:
- A distributed messaging and streaming platform supporting pub/sub, streaming, and queue semantics with persistent storage and at-least-once delivery guarantees by default.
-
Multi-tenancy, topic-level isolation, and built-in geo-replication are core design features. What it is NOT:
-
Not just a lightweight broker; it includes a distributed storage layer and bookkeeper-like persistence model.
- Not purely an event bus; it is a full streaming platform with replay, partitioning, and retention controls.
Key properties and constraints:
- Decoupled serving and storage: brokers handle routing and consumers while a separate storage layer persists messages.
- Partitioned topics for parallelism and ordering per partition.
- Native support for geo-replication and tenant isolation.
- Strong consistency within partitions for ordered delivery with configurable delivery semantics.
- Operational complexity: requires orchestration and storage management for production-grade deployments.
- Latency and throughput trade-offs depend on deployment topology and storage configuration.
Where it fits in modern cloud/SRE workflows:
- Ingest layer for event-driven architectures and data pipelines.
- Buffering and durable message transport between microservices, analytics, and downstream sinks.
- Replacement for legacy message queues when persistence, replay, or multi-region replication is required.
- Integrates with Kubernetes for cloud-native deployments and with serverless functions for event-driven compute.
Text-only diagram description (visualize):
- Imagine three layers stacked vertically: Top layer clients (producers and consumers) connect to a middle layer of stateless brokers. The brokers route messages and forward persistence operations to the bottom layer of bookie nodes that store segments. A metadata layer holds topic and ledger metadata and a manager coordinates replication and lifecycle. Geo-replication copies ledgers across regions asynchronously. Monitoring and operators sit alongside interacting via APIs and control plane.
Pulsar in one sentence
A cloud-native, multi-tenant streaming and messaging platform that decouples serving and storage to provide durable, replayable, and geo-replicated event delivery.
Pulsar vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from Pulsar | Common confusion |
|---|---|---|---|
| T1 | Kafka | Broker-focused cluster with log storage inside brokers | Both are streaming systems |
| T2 | RabbitMQ | Broker with complex routing and in-memory focus | Often used for short-lived messages |
| T3 | BookKeeper | Storage layer concept used by Pulsar | BookKeeper is storage not full messaging |
| T4 | Event Mesh | Architectural pattern across networks | Pulsar can implement this but is a product |
| T5 | Stream Processing | Compute over streams | Pulsar provides streams not necessarily processing |
| T6 | Message Queue | Simple queue semantics | Pulsar supports more features and persistence |
| T7 | Service Bus | Enterprise integration suite | Pulsar focuses on events not full ESB features |
| T8 | Kinesis | Managed AWS streaming service | Kinesis is cloud managed; Pulsar can be self-hosted |
| T9 | Managed Pulsar | Cloud-hosted Pulsar offering | Implementation and SLA vary by vendor |
| T10 | Schema Registry | Type management for events | Pulsar includes schema support natively |
Row Details (only if any cell says “See details below”)
- None.
Why does Pulsar matter?
Business impact:
- Revenue: Durable, high-throughput messaging reduces lost events and supports real-time revenue features like personalization or trading.
- Trust: Geo-replication and persistence lower data loss risk and contractual SLA exposure.
- Risk: Complex operations can increase vendor or ops risk if not automated.
Engineering impact:
- Incident reduction: Proper isolation and retention can reduce production data loss incidents.
- Velocity: Developers build event-driven features faster when replay and durability are available.
- Complexity adds overhead: introduces storage management and cross-region consistency tasks.
SRE framing:
- SLIs/SLOs: Throughput, publish latency, consume lag, write durability, and replication success rate are primary SLIs.
- Error budgets: Use delivery failure rates and replication lag to define burn rates.
- Toil/on-call: Operational automation around broker scaling, ledger compaction, and bookie replacements reduces toil.
3–5 realistic “what breaks in production” examples:
- Broker crash causing increased consumer reconnects and partition leadership churn.
- Bookie disk failure causing ledger under-replicated states and slow recovery.
- Misconfigured retention leading to unexpected data deletion and failed replays.
- Network partition between regions causing geo-replication backlog and increased latency.
- Tenant or topic hot-spot causing broker CPU exhaustion and downstream pipeline slowdown.
Where is Pulsar used? (TABLE REQUIRED)
| ID | Layer/Area | How Pulsar appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Edge/Ingress | Ingest buffer for telemetry and events | Ingest rate and latency | See details below: L1 |
| L2 | Network/Transport | Message routing and fanout | Broker CPU and network | Prometheus Grafana |
| L3 | Service/Application | Event bus between microservices | Consumer lag and errors | OpenTelemetry |
| L4 | Data/Analytics | Streaming source for pipelines | Throughput and retention | Flink Spark |
| L5 | Cloud/Kubernetes | Deployed as stateful services | Pod restarts and PVC usage | Helm Operators |
| L6 | Serverless/PaaS | Trigger for functions | Invocation latency and retries | Functions platform |
| L7 | CI/CD & Ops | Integration tests and pipelines | Test publish/consume success | CI tools |
| L8 | Observability/Security | Audit events and telemetry | Audit logs and access denials | SIEM Prometheus |
Row Details (only if needed)
- L1: Edge gateways publish bursts from devices, use batching and TLS termination.
- L6: Serverless platforms subscribe to topics to trigger short-lived compute.
When should you use Pulsar?
When it’s necessary:
- You need durable, replayable streams with geo-replication across regions.
- Multi-tenancy and strict isolation between teams or customers are required.
- High throughput with long retention windows and large-scale partitioning is needed.
When it’s optional:
- Small-scale pub/sub inside a single region with simple queue semantics.
- Very low-latency in-memory pub/sub where persistence is not required.
When NOT to use / overuse it:
- For simple point-to-point RPC; lightweight message brokers may be simpler.
- When organizational maturity cannot support operating storage and replication at scale.
- For tiny workloads where managed cloud queues are significantly cheaper.
Decision checklist:
- If you need durability and replay and expect >100k msgs/s -> use Pulsar.
- If you need simple ephemeral messaging with minimal ops -> consider SaaS queue.
- If you need serverless triggers and prefer managed operations -> evaluate managed Pulsar offerings.
Maturity ladder:
- Beginner: Single-region, few topics, limited partitions, operator-managed.
- Intermediate: Multiple namespaces, automated Helm/Operator deployments, basic SLOs.
- Advanced: Multi-region geo-replication, dynamic partitioning, automated scaling, full SRE on-call.
How does Pulsar work?
Components and workflow:
- Producer: Sends messages to a topic via broker endpoint.
- Broker: Stateless routing layer that handles client connections and enforces topic policies.
- Bookkeeper (bookies): Storage nodes that persist ledgers (message segments) across replicas.
- Metadata store: Stores topic metadata, ledger locations, and cluster configuration.
- ZooKeeper or metadata service: Historically used for coordination; specifics vary with Pulsar versions (metadata implementations evolved).
- Consumers: Pull or receive messages with acknowledgment semantics; can replay from offsets.
Data flow and lifecycle:
- Producer sends messages to broker.
- Broker validates and routes request, assigns ledger/segment.
- Broker writes message to bookies; bookies persist data across replicas.
- Once write quorum returns, broker acknowledges producer.
- Consumers fetch messages; after processing they ACK, enabling retention/compaction policies to free storage.
- Data older than retention or compacted keys get removed; ledgers are archived or deleted as configured.
Edge cases and failure modes:
- Under-replicated ledgers when bookies fail; need quick re-replication.
- Broker restarts requiring client reconnection and potential short consumer stalls.
- Hot partitions causing skewed throughput across consumers.
- Geo-replication backlog growth during network partitions.
Typical architecture patterns for Pulsar
- Ingest and Fan-out: Producers publish telemetry; multiple consumer groups process in parallel. Use when many downstream consumers need the same events.
- Queue/Work Distribution: Use exclusive or shared subscriptions for worker queues with acknowledgement semantics. Use when distributing tasks across workers.
- Stream Processing with Connectors: Pulsar as source to stream processing engines (Flink) and sinks (object stores). Use for ETL and analytics.
- Event Sourcing: Store events for domain-driven designs with retention and replay. Use when auditability and replay are required.
- Geo-Replicated Global Bus: Multiple regions with replication policies for disaster recovery and low-latency reads. Use for global applications requiring redundancy.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | Broker crash | Clients reconnecting frequently | Broker OOM or CPU spike | Scale brokers and tune JVM | Broker restarts metric |
| F2 | Bookie disk full | Writes failing or slow | Retention misconfig or disk leak | Add storage or prune retention | Disk usage and write errors |
| F3 | Under-replicated ledgers | Replication count below target | Bookie failure | Trigger re-replication and replace bookie | Underreplicated ledger count |
| F4 | Geo-replication lag | Remote consumers see old data | Network partition or slow link | Increase replication threads or bandwidth | Replication backlog |
| F5 | Topic hot-spot | High latency on a partition | Uneven partitioning | Repartition topic or add consumers | Per-partition throughput |
| F6 | Authentication failure | Clients rejected | Misconfigured auth or expired token | Rotate tokens and update configs | Auth failure logs |
| F7 | Metadata corruption | Topics inaccessible | Metadata store failure | Restore metadata backup; prevent writes | Metadata error events |
| F8 | Excessive retention cost | Unexpected storage bills | Long retention with high volume | Implement compaction and TTL | Retained bytes and cost metrics |
Row Details (only if needed)
- F3: Under-replicated ledgers can accumulate if recovery is slow. Steps: monitor underreplicated count, replace failed nodes, run rebalance.
- F4: Replication backlog after long partition can be huge. Steps: throttle producer, increase replication concurrency, apply temporary retention increases.
- F5: Repartitioning may require downtime or careful topic migration; use partitioned topics and measure consumer lag during migration.
Key Concepts, Keywords & Terminology for Pulsar
Glossary (40+ terms):
Acknowledge — Client confirms message processed — Enables retention/amendment — Missing ACKs cause redelivery
Auto Topic Creation — Automatic topic provisioning — Simplifies dev experience — Can cause tag proliferation
Backlog — Unconsumed messages stored — Reflects retention and lag — Growing backlog increases storage cost
Broker — Stateless request handler — Routes messages to storage — Broker OOM affects availability
Bookie — Storage node in BookKeeper model — Persists ledgers — Disk failures lead to under-replication
Compaction — Key-based retention optimization — Reduces storage for keyed events — Misusing may lose history
Consumer — Reads messages from topic — Multiple modes exist — Misconfigured mode causes duplicates
Deduplication — Remove duplicate publishes — Ensures idempotence for producers — Requires message ID tracking
Delivery Semantics — At-least-once or effectively-once patterns — Defines client guarantees — Choosing wrong leads to duplicates
Geo-Replication — Replicates topics across regions — Enables DR and locality — Network issues cause backlog
Ledger — Storage segment of messages — Unit of persistence — Corrupt ledgers block topic reads
Namespace — Multi-tenant grouping of topics — Used for isolation and quotas — Misconfiguration exposes resources
Partitioned Topic — Topic split into partitions — Enables parallelism — Too many partitions cause overhead
Producer — Client sending messages — Controls batching and routing — Bad batch settings increase latency
Retention — How long messages are kept — Determines replayability — Long retention increases cost
Schema — Type constraints for messages — Provides compatibility guarantees — Schema evolution needs care
Subscription — Consumer group concept — Shared or exclusive modes — Wrong mode breaks delivery pattern
Subscription Position — Earliest, latest, or timestamp — Controls where consumers start — Misuse causes missing data
Sync Replication — Synchronous replication mode — Stronger durability — Slower writes
Async Replication — Asynchronous replication mode — Faster writes, eventual replication — Risk of data lag
BookKeeper Ensemble — Group of bookies storing ledger replicas — Ensemble size impacts durability — Too small risks data loss
Metadata Store — Stores topic and ledger metadata — Critical for cluster health — Corrupt metadata prevents operations
Znode — Metadata node entry (coordination store) — Used for locking and config — Zookeeper specifics may vary
TTL — Time to live for messages — Automatic deletion threshold — Wrong TTL deletes needed data
Message ID — Unique identifier per message — Needed for ack and replay — Mismanaging breaks dedupe
End-to-end Latency — Time from publish to consume — Key SLI — High variability breaks SLAs
Schema Registry — Manages schemas for topics — Ensures compatibility — Version misalignment breaks deserialization
Backpressure — Flow control under load — Protects brokers and consumers — Ignoring leads to OOM or stalls
Offload — Archive old ledgers to long-term storage — Saves disk space — Offload failures risk data loss
Retention Policy — Rules for storage retention — Controls cost vs replay — Inconsistent policy causes surprises
Consumer Lag — Difference between head and processed offset — Measures freshness — High lag indicates bottleneck
Partition Key — Used to route events to partition — Preserves ordering for a key — Using high-cardinality keys causes hotspots
Subscription Types — Exclusive, shared, failover — Determines consumer behavior — Wrong type alters semantics
Message TTL — Per-message expiry control — Auto-removes stale events — Misuse deletes live data
Load Manager — Balances topics across brokers — Prevents hotspots — Incorrect tuning causes imbalance
Pulsar Functions — Lightweight compute for event processing — Useful for simple transforms — Not a replacement for full stream engines
Connector — Prebuilt source/sink integration — Speeds pipeline building — Wrong connector causes duplication
Auth/Authorization — Security model for access control — Protects topics and namespaces — Weak config risks exposure
TLS — Encryption for client and internode traffic — Required for confidential data — Certificates must be rotated
Metrics Exporter — Exposes Pulsar metrics — Essential for SRE — Missing exporters blind ops
Schema Validation — Runtime message type checking — Prevents consumer errors — Strict validation blocks compatible producers
Consumer Groups — Logical grouping of consumers — Scale reading work — Misuse creates duplicate processing
How to Measure Pulsar (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | Publish Latency | Time producer waits for ack | 95th percentile of publish time | <100 ms | Peaks on GC and disk spikes |
| M2 | End-to-End Latency | Time from publish to ACK by consumer | 95th percentile from publish to consume | <500 ms | Depends on consumer processing |
| M3 | Consumer Lag | Messages behind head per subscription | Consumer lag gauge | <10k msgs or <30s | High retention increases lag gauge |
| M4 | Under-replicated Ledgers | Count of under-replicated ledgers | Count from metadata | 0 | Can spike after failures |
| M5 | Broker CPU | Broker CPU utilization | Avg CPU across brokers | <70% | JVM GC can cause short spikes |
| M6 | Bookie Disk Usage | Storage used per bookie | Percent disk used | <70% | Sudden growth on retention change |
| M7 | Replication Backlog | Messages pending to replicate | Messages or bytes pending | Near 0 | Network partitions show high backlog |
| M8 | Publish Error Rate | Failed publishes per second | Errors / total publishes | <0.1% | Transient auth issues inflate rate |
| M9 | Consumer Error Rate | Consumer processing failures | Failures per sec | <0.5% | Deserialization errors common cause |
| M10 | Broker Restarts | Number of restarts per day | Restart count | 0 | Rolling upgrades may increase count |
| M11 | Message Loss Incidents | Confirmed lost messages | Post-incident audit | 0 | Hard to detect without end-to-end checks |
| M12 | Schema Rejection Rate | Messages rejected by schema | Rejected count / total | <0.1% | Version mismatches cause rejections |
| M13 | Connection Errors | Failed client connects | Error count | Low | Credential expiry spikes this |
| M14 | Topic Hotspots | Top topics by throughput | Top N list | Identify and redistribute | High cardinality causes hotspots |
Row Details (only if needed)
- None.
Best tools to measure Pulsar
Tool — Prometheus + Grafana
- What it measures for Pulsar: Broker, bookie, and proxy metrics, JVM, latency and lag.
- Best-fit environment: Kubernetes and on-prem clusters.
- Setup outline:
- Export Pulsar metrics via Prometheus exporter.
- Configure scrape jobs for brokers and bookies.
- Create Grafana dashboards for key metrics.
- Set alerting rules in Prometheus or Alertmanager.
- Strengths:
- Flexible query language and dashboarding.
- Widely adopted in cloud-native stacks.
- Limitations:
- Requires maintenance and scaling for high-cardinality metrics.
- Long-term storage needs extra components.
Tool — OpenTelemetry
- What it measures for Pulsar: Traces for publish and consume operations.
- Best-fit environment: Distributed applications needing tracing.
- Setup outline:
- Instrument producers and consumers.
- Capture spans for publish/receive operations.
- Export to tracing backend.
- Strengths:
- Standardized tracing across services.
- Useful for root-cause analysis.
- Limitations:
- Sampling decisions affect visibility.
- Client instrumentation needed.
Tool — Pulsar Manager / Cluster Manager UI
- What it measures for Pulsar: Cluster state, topic and subscription overview.
- Best-fit environment: Administrative operations.
- Setup outline:
- Deploy manager alongside cluster.
- Configure access control.
- Use UI for topic management and metrics.
- Strengths:
- Operational visibility into namespaces and topics.
- Easier for non-SRE users.
- Limitations:
- Not a full monitoring solution.
- Scalability varies with cluster size.
Tool — Fluentd / Log Aggregator
- What it measures for Pulsar: Client and broker logs for errors and audit.
- Best-fit environment: Centralized logging.
- Setup outline:
- Ship logs from brokers and bookies to central sink.
- Parse and tag pulsar logs.
- Create alerts for error patterns.
- Strengths:
- Rich context for debugging.
- Searchable history.
- Limitations:
- Requires log volume storage and parsing rules.
- Noise without filtering.
Tool — Stream Processing Metrics (Flink/Spark)
- What it measures for Pulsar: End-to-end processing latency and throughput in stream jobs.
- Best-fit environment: Pipelines using stream processing engines.
- Setup outline:
- Instrument job metrics and export to monitoring.
- Correlate with Pulsar metrics.
- Strengths:
- Shows pipeline health tied to Pulsar.
- Useful for backpressure detection.
- Limitations:
- Requires integration work.
- Job-level metrics not equal to broker health.
Recommended dashboards & alerts for Pulsar
Executive dashboard:
- Panels: Cluster health summary, total throughput, cross-region replication status, storage used, incident count.
- Why: High-level view for stakeholders and execs.
On-call dashboard:
- Panels: Broker CPU and GC, bookie disk usage, under-replicated ledgers, replication lag, publish error rate, consumer lag per critical namespace.
- Why: Rapid triage and remediation for SREs.
Debug dashboard:
- Panels: Per-topic throughput, per-partition latency, consumer ack rates, connection errors, recent broker logs, per-node JVM metrics.
- Why: Deep troubleshooting for engineers during incidents.
Alerting guidance:
- Page (urgent): Sustained under-replicated ledgers, replication backlog exceeding threshold for >5 minutes, majority broker nodes down.
- Ticket (non-urgent): Single broker restart, brief spike in publish errors <5 minutes, disk usage approaching threshold.
- Burn-rate guidance: If SLO error budget burned at >50% in 6 hours, escalate and run remediation playbook.
- Noise reduction: Deduplicate alerts using grouping keys, suppress known noisy windows during rolling upgrades, apply rate-limited notifications.
Implementation Guide (Step-by-step)
1) Prerequisites: – Capacity planning for expected throughput and retention. – Kubernetes cluster or VM pool sized for brokers and bookies. – Storage planning for bookies with IOPS and disk durability considerations. – Security policies: TLS, authentication, authorization plans. – Backup and metadata snapshot strategies.
2) Instrumentation plan: – Export core metrics, implement tracing for producers and consumers, capture logs and audit events. – Define SLIs and export them as metrics.
3) Data collection: – Set up Prometheus scrapers, logging pipeline, and tracing collectors. – Ensure metrics retention for SLO calculation.
4) SLO design: – Define SLOs for publish latency, end-to-end latency, and availability. – Set error budgets and remediation thresholds.
5) Dashboards: – Build executive, on-call, and debug dashboards. – Include per-namespace and per-topic drilldowns.
6) Alerts & routing: – Define alert thresholds and who to page. – Implement dedupe and silence during maintenance.
7) Runbooks & automation: – Create runbooks for common failures like bookie replacement, re-replication, and partition rebalance. – Automate scaling and repairs where possible.
8) Validation (load/chaos/game days): – Run load tests with realistic producer patterns. – Perform chaos tests for broker and bookie failures. – Run DR tests for geo-replication failover.
9) Continuous improvement: – Weekly reviews of alerts and incidents. – Quarterly review of retention and cost. – Update SLOs based on observed behavior.
Pre-production checklist:
- Capacity and retention validated with load tests.
- Security policies tested with sample clients.
- Observability stack deployed and dashboards ready.
- Runbooks created and reviewed.
Production readiness checklist:
- Alerting and escalation paths tested.
- Automatic recovery playbooks implemented.
- Backup and restore validated.
- On-call rotation and runbook access confirmed.
Incident checklist specific to Pulsar:
- Confirm scope: affected namespaces/topics.
- Check under-replicated ledgers and replication backlog.
- Verify broker and bookie node health.
- If replication lag high, throttle producers and scale replication.
- Notify stakeholders and run remediation playbook.
Use Cases of Pulsar
Provide 8–12 use cases:
1) Telemetry Ingestion – Context: High-volume device telemetry. – Problem: Durable ingestion with replay for analytics. – Why Pulsar helps: High throughput, retention, and fan-out to multiple consumers. – What to measure: Ingest rate, publish latency, backlog. – Typical tools: Prometheus, Flink.
2) Event Sourcing for Microservices – Context: Business events stored as source of truth. – Problem: Need ordering and replay for rebuilds. – Why Pulsar helps: Partitioned topics and durable storage. – What to measure: End-to-end latency, retention usage. – Typical tools: Schema registry, stream processors.
3) Stream Processing Pipeline – Context: ETL from events to data lake. – Problem: Fault-tolerant pipeline with exactly-once semantics. – Why Pulsar helps: Durable source with connectors and integration. – What to measure: Processing latency, consumer failures. – Typical tools: Flink, connectors.
4) Job Queue / Worker Pool – Context: Background task processing. – Problem: Reliable distribution and retry semantics. – Why Pulsar helps: Shared subscriptions, ack policies. – What to measure: Task completion rates, redelivery counts. – Typical tools: Worker frameworks, monitoring.
5) Geo-Replicated Messaging – Context: Global application with regional failover. – Problem: Data locality and disaster recovery. – Why Pulsar helps: Built-in replication policies. – What to measure: Replication lag, cross-region latency. – Typical tools: SIEM, cross-region monitors.
6) Audit Trail and Compliance – Context: Regulatory logging and immutable audit. – Problem: Retention and access controls. – Why Pulsar helps: Durable retention and namespaces for isolation. – What to measure: Retention enforcement, schema rejections. – Typical tools: Archival offload to object storage.
7) Notification and Fan-out – Context: Multi-channel notifications from events. – Problem: Delivering same event to many consumers. – Why Pulsar helps: Efficient fan-out with multiple subscriptions. – What to measure: Delivery success rate, subscriber latency. – Typical tools: Functions for transformation.
8) ML Feature Streaming – Context: Real-time feature updates for models. – Problem: Low-latency, consistent updates across replicas. – Why Pulsar helps: Streaming guarantees and replay for feature regeneration. – What to measure: End-to-end latency, schema compatibility. – Typical tools: Feature stores, stream processors.
9) CI/CD Event Bus – Context: Build and deploy events across pipelines. – Problem: Reliable orchestration events and tracing. – Why Pulsar helps: Durable and multi-tenant topics for pipelines. – What to measure: Event delivery rates, processing failures. – Typical tools: CI/CD systems and event handlers.
10) Financial Trading Events – Context: High-frequency market events. – Problem: Ordering, durability, and replication for compliance. – Why Pulsar helps: Partitioning for ordering and geo-replication. – What to measure: Publish latency, message loss incidents. – Typical tools: Low-latency hardware and tuned bookies.
Scenario Examples (Realistic, End-to-End)
Scenario #1 — Kubernetes-based Event Pipeline
Context: E-commerce platform deployed on Kubernetes processes order events.
Goal: Build a scalable event pipeline for order ingestion, fulfillment, and analytics.
Why Pulsar matters here: Provides durable ingestion, replay for analytics, and namespace isolation per environment.
Architecture / workflow: Producers (order API pods) -> Pulsar brokers (K8s statefulset) -> Bookies (statefulset with PVCs) -> Consumers: fulfillment workers and analytics Flink job.
Step-by-step implementation:
- Provision K8s cluster with storage class for bookies.
- Deploy Pulsar Operator and create cluster CR.
- Configure namespaces and auth policies for services.
- Create partitioned topic orders with partition key customer_id.
- Deploy consumers and Flink job to read from orders topic.
- Set up Prometheus scraping for metrics and Grafana dashboards.
What to measure: Publish latency, consumer lag, broker restarts, disk usage.
Tools to use and why: Kubernetes Operator for lifecycle; Prometheus/Grafana for metrics; Flink for processing.
Common pitfalls: PVC IOPS insufficient, leading to disk write latency.
Validation: Load test with synthetic order traffic and monitor SLOs.
Outcome: Scalable ingestion with replay ability for analytics.
Scenario #2 — Serverless / Managed-PaaS Integration
Context: SaaS product using managed functions for event-driven compute.
Goal: Trigger serverless functions from events and guarantee at-least-once invocation.
Why Pulsar matters here: Durable message delivery with function integrations and retries.
Architecture / workflow: Producers -> Pulsar brokers (managed or hosted) -> Functions platform (subscriber) -> Downstream services.
Step-by-step implementation:
- Use managed Pulsar offering or managed connectors to surface topics.
- Configure function triggers with appropriate subscription type.
- Enable auth tokens and TLS for secure connectivity.
- Monitor invocation failures and set dead-letter topic for failed messages.
What to measure: Invocation latency, function failure rate, DLQ size.
Tools to use and why: Managed Pulsar, Functions runtime, log aggregator.
Common pitfalls: Cold starts in serverless combine with message backlog causing timeouts.
Validation: Simulate burst traffic and verify retries to DLQ.
Outcome: Reliable serverless triggers with durable retry handling.
Scenario #3 — Incident Response and Postmortem
Context: Production incident where bookie failures caused delayed processing.
Goal: Triage and resolve replication backlog, prevent recurrence.
Why Pulsar matters here: Storage failures impact data availability and SLOs.
Architecture / workflow: Normal cluster with replication across two regions; bookie failure in primary region.
Step-by-step implementation:
- Identify under-replicated ledgers from metrics.
- Replace failed bookies and initiate re-replication.
- Throttle producers if replication backlog threatens storage.
- Create incident ticket and execute runbook for bookie replacement.
What to measure: Under-replicated ledgers, replication backlog, publish errors.
Tools to use and why: Monitoring dashboards, logs, runbooks.
Common pitfalls: Missing automated replacement caused manual delay.
Validation: Postmortem with timeline, root cause, and mitigation plan.
Outcome: Recovered replication, SLO review, automated remediation added.
Scenario #4 — Cost vs Performance Trade-off
Context: High-volume event retention causing high storage cost.
Goal: Reduce storage cost while preserving replay for 30-day window.
Why Pulsar matters here: Retention and offload policies directly impact cost.
Architecture / workflow: Pulsar cluster with offload to object storage.
Step-by-step implementation:
- Measure retention and growth trends.
- Implement ledger offload to cheaper object storage after 7 days.
- Use compaction for key-based topics to reduce size.
- Monitor read latency for offloaded segments and adjust SLOs.
What to measure: Retained bytes, offload success rate, read latency for offloaded data.
Tools to use and why: Storage analytics, object store metrics, Pulsar offload monitoring.
Common pitfalls: Offload retrieval latency causing consumer slowdowns.
Validation: Cost analysis and read performance test.
Outcome: Lower cost with acceptable read latency trade-off.
Common Mistakes, Anti-patterns, and Troubleshooting
List 20 mistakes with symptom -> root cause -> fix:
- Symptom: High publish latency -> Root cause: Broker CPU or GC -> Fix: Tune JVM, increase brokers.
- Symptom: Consumer duplicate processing -> Root cause: At-least-once semantics and missing idempotency -> Fix: Implement idempotent consumers or dedupe.
- Symptom: Under-replicated ledgers -> Root cause: Bookie failure -> Fix: Replace bookie and run re-replication.
- Symptom: Unexpected data deletion -> Root cause: Misconfigured retention or TTL -> Fix: Review and correct retention policies.
- Symptom: Hot partition with high latency -> Root cause: High-cardinality partition keys not used correctly -> Fix: Repartition or change partition key.
- Symptom: Produce errors during upgrades -> Root cause: Rolling upgrade without client coordination -> Fix: Apply client retry/backoff and coordinate upgrade windows.
- Symptom: Metric gaps in monitoring -> Root cause: Exporter misconfiguration -> Fix: Ensure metric scrape targets and endpoints are correct.
- Symptom: Excessive storage cost -> Root cause: Long retention and no offload -> Fix: Implement offload and compaction.
- Symptom: Connection auth failures -> Root cause: Token expiry or incorrect certs -> Fix: Rotate tokens and update clients.
- Symptom: Sudden broker restarts -> Root cause: OOM/killed processes -> Fix: Increase memory and tune heap.
- Symptom: High consumer lag -> Root cause: Slow consumer processing or backpressure -> Fix: Scale consumers and optimize processing.
- Symptom: Geo-replication backlog -> Root cause: Network partition -> Fix: Restore network, increase replication throughput.
- Symptom: Topic creation abuse -> Root cause: Auto-create enabled for dev clients -> Fix: Disable auto-create and enforce namespace quotas.
- Symptom: Schema incompatibility errors -> Root cause: Uncoordinated schema changes -> Fix: Enforce schema evolution rules.
- Symptom: Noisy alerts -> Root cause: Low thresholds and no dedupe -> Fix: Tune thresholds, apply grouping.
- Symptom: Playback failures -> Root cause: Missing ledgers after retention expiry -> Fix: Adjust retention or offload strategy.
- Symptom: Slow ledger compaction -> Root cause: Compaction running on overloaded nodes -> Fix: Schedule compaction during low traffic.
- Symptom: Incomplete backups -> Root cause: Offload failures -> Fix: Monitor offload success and retry on failure.
- Symptom: Debugging blind spots -> Root cause: No tracing or correlation IDs -> Fix: Add OpenTelemetry tracing and message IDs.
- Symptom: Security exposure -> Root cause: Open ports and weak auth -> Fix: Enforce TLS and RBAC.
Observability pitfalls (at least 5 included above):
- Missing tracing causes long MTTR -> add tracing.
- High-cardinality metrics overload Prometheus -> reduce cardinality or use remote write.
- No per-topic metrics hides hotspots -> enable per-topic telemetry.
- Ignoring log parsing loses error context -> centralize and parse logs.
- Not measuring backlog hides replication issues -> monitor backlog metrics.
Best Practices & Operating Model
Ownership and on-call:
- Define cluster ownership: platform team owns infrastructure; app teams own topics and consumers.
- Ensure on-call rotations include platform engineers with runbook access.
Runbooks vs playbooks:
- Runbooks: Step-by-step operational commands for routine failures.
- Playbooks: High-level decision trees for escalations and cross-team coordination.
Safe deployments:
- Canary: Deploy brokers or config changes to a subset of brokers first.
- Rollback: Keep automated rollback for deployments failing health checks.
Toil reduction and automation:
- Automate bookie replacement, scaling, and offload verification.
- Implement automatic alerts for under-replication and replication lag.
Security basics:
- Enforce TLS for client and internode traffic.
- Use token-based authentication and RBAC at namespace/topic level.
- Rotate credentials and certificates on a schedule.
Weekly/monthly routines:
- Weekly: Check health dashboards, under-replicated ledgers, storage growth.
- Monthly: Review retention policies and access logs; run disaster recovery drills.
What to review in postmortems related to Pulsar:
- Timeline of broker and bookie events.
- Metrics around publish and consume latencies.
- Root cause in storage or network.
- Actions taken and automation added.
- Impact to SLOs and customers.
Tooling & Integration Map for Pulsar (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | Monitoring | Collects metrics and alerts | Prometheus Grafana | Use exporters on brokers |
| I2 | Logging | Aggregates logs | Fluentd ELK | Parse pulsar logs for errors |
| I3 | Tracing | Traces publish and consume | OpenTelemetry | Instrument producers and consumers |
| I4 | Stream Processing | Stateful processing of streams | Flink Spark | Use Pulsar connectors |
| I5 | Functions | Lightweight event compute | Pulsar Functions | Good for simple transforms |
| I6 | CI/CD | Integration tests and deployment | Helm Operators | Use operator for k8s lifecycle |
| I7 | Security | Auth and RBAC | TLS OAuth | Enforce per-namespace policies |
| I8 | Backup | Offload to object storage | S3 GCS | Validate restores regularly |
| I9 | Schema Management | Enforce message schemas | Built-in registry | Version control for schemas |
| I10 | Managed Services | Hosted Pulsar offerings | Cloud IAM | Vendor SLAs vary |
Row Details (only if needed)
- None.
Frequently Asked Questions (FAQs)
What is Pulsar used for?
Pulsar is used for durable messaging, streaming, ingestion, and multi-consumer fan-out with replay and geo-replication.
Is Pulsar a replacement for Kafka?
It can be for many streaming use cases; differences include architecture and operational model.
Does Pulsar guarantee exactly-once delivery?
Not inherently; it provides at-least-once semantics. Exactly-once often requires upstream and downstream idempotence or processing frameworks.
How does Pulsar store messages?
Messages are persisted to ledger segments in bookies; metadata tracks ledger locations.
Can Pulsar run on Kubernetes?
Yes; operators and Helm charts support Kubernetes deployments.
Is Pulsar multi-tenant?
Yes; namespaces and tenant constructs provide multi-tenancy.
What are common causes of message loss?
Misconfigured retention, under-replicated ledgers, and improper offload setups.
How to handle schema evolution?
Use Pulsar’s schema registry and enforce compatibility rules.
How to secure Pulsar?
Use TLS, token-based auth, and RBAC for namespace/topic access.
How to scale Pulsar for high throughput?
Increase brokers and partitions, tune bookie storage and network, and partition topics appropriately.
Can Pulsar be managed as a service?
Yes; managed Pulsar offerings exist but SLA and features vary by vendor.
How to monitor Pulsar health?
Collect broker, bookie, and replication metrics; monitor under-replicated ledgers and consumer lag.
What is a bookie?
A storage node in the BookKeeper model that persists message ledgers.
How to prevent topic sprawl?
Disable auto-creation or enforce quotas per namespace.
What happens during metadata store failure?
Topics may become inaccessible; restore from metadata backups.
How to test Pulsar in staging?
Run load tests replicating production traffic patterns and perform chaos tests on brokers/bookies.
How to keep costs under control?
Use offload, compaction, reasonable retention, and monitor stored bytes.
What are best SLOs for Pulsar?
Start with publish latency SLOs and consumer lag targets; tune per-application needs.
Conclusion
Apache Pulsar is a robust, cloud-native streaming and messaging solution that fits modern event-driven architectures when durability, replay, and multi-region replication matter. Operating Pulsar well requires investment in observability, automation, and SRE practices to manage storage and replication complexity.
Next 7 days plan:
- Day 1: Run capacity and retention estimation for target workloads.
- Day 2: Deploy a dev Pulsar cluster and enable Prometheus metrics.
- Day 3: Create sample topics and instrument a producer and consumer with tracing.
- Day 4: Implement SLOs and dashboards for publish latency and consumer lag.
- Day 5: Run a load test and observe bottlenecks.
- Day 6: Create runbooks for core failures and test a failover.
- Day 7: Review cost and retention policy; draft production readiness checklist.
Appendix — Pulsar Keyword Cluster (SEO)
- Primary keywords
- pulsar messaging
- apache pulsar
- pulsar streaming
- pulsar architecture
-
pulsar tutorial
-
Secondary keywords
- pulsar vs kafka
- pulsar broker
- pulsar bookie
- pulsar topics
-
pulsar geo-replication
-
Long-tail questions
- how to deploy pulsar on kubernetes
- pulsar best practices for production
- how does pulsar store messages
- pulsar monitoring and observability
-
pulsar schema registry usage
-
Related terminology
- bookkeeper ledger
- partitioned topic
- namespace quotas
- pulsar functions
- retention policy
- replication backlog
- consumer lag
- publish latency
- under-replicated ledger
- offload to object storage
- message id tracking
- schema evolution
- token authentication
- TLS encryption
- Prometheus Grafana
- OpenTelemetry tracing
- Flink connector
- serverless triggers
- multi-tenant isolation
- JVM tuning
- ledger compaction
- runbook automation
- canary deployments
- chaos testing
- cost optimization
- topic partitioning
- hot partition mitigation
- DLQ dead-letter topic
- message deduplication
- producer batching
- subscription types
- exclusive subscription
- shared subscription
- failover subscription
- schema validation
- data replay
- audit trail
- backup and restore
- storage offload
- message TTL
- metadata store
- operator lifecycle
- managed pulsar service
- event sourcing with pulsar
- real-time analytics
- ingest pipeline
- notifications and fanout
- feature streaming
- CI CD event bus