Quick Definition (30–60 words)
Pub Sub is a messaging pattern where publishers send messages to topics and subscribers receive messages asynchronously. Analogy: a newspaper press prints editions and readers subscribe to topics. Formal: a decoupled, often brokered messaging model enabling asynchronous one-to-many and fan-out communication across distributed systems.
What is Pub Sub?
Pub Sub (publish–subscribe) is a communication model for asynchronous event distribution. Publishers emit events to named topics without knowing consumers. Subscribers express interest in topics and receive events they care about. It is NOT point-to-point queuing (though many systems support both). It is not a database replacement; it’s event transport.
Key properties and constraints:
- Decoupling: producers and consumers are separated in time and space.
- Delivery semantics: at-most-once, at-least-once, or exactly-once (varies by implementation).
- Ordering: topic-level ordering may be weak or guaranteed with partitions.
- Retention: messages can be ephemeral or retained for replays.
- Scaling: horizontal scaling via partitions, shards, or consumer groups.
- Security: ACLs, encryption, and auth are critical in multi-tenant systems.
- Latency vs durability trade-offs: low-latency streaming often decreases retention guarantees.
Where it fits in modern cloud/SRE workflows:
- Event-driven microservices coordination.
- Decoupling ingress from processing in Kubernetes and serverless.
- Buffering bursts to protect downstream services.
- Observability and control plane events in CI/CD and incident response.
- AI pipelines for data ingestion, feature updates, and model signals.
Diagram description (text-only):
- Publishers -> Topic (broker or cluster) -> Partitioned storage -> Subscription broker -> Delivery to subscribers/consumers -> Acknowledgment path back to broker.
Pub Sub in one sentence
A messaging architecture that decouples event producers from consumers by routing published messages to interested subscribers through topics and brokers.
Pub Sub vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from Pub Sub | Common confusion |
|---|---|---|---|
| T1 | Message Queue | Point-to-point delivery to a queue consumer | Confused with fan-out behavior |
| T2 | Event Bus | Broader ecosystem term for event routing and tooling | Used interchangeably with Pub Sub |
| T3 | Stream Processing | Focus on continuous computation over message streams | People think Pub Sub does processing |
| T4 | Brokerless Pub Sub | Uses direct peer routing without central broker | Assumed to be more reliable always |
| T5 | Notification System | Often one-way alerts with no replay | Thought to be full-featured Pub Sub |
| T6 | Event Sourcing | Persisting state changes as events in an append log | Mistaken for generic Pub Sub |
| T7 | CDC (Change Data Capture) | Source-specific streaming of DB changes | Confused as a replacement for Pub Sub |
| T8 | Webhook | HTTP callback for events | Considered the same as Pub Sub but is push-only |
| T9 | Batching Queue | Groups messages for throughput | Assumed to be real-time Pub Sub |
| T10 | MQTT Broker | Lightweight protocol for IoT Pub Sub | Thought to be identical to cloud Pub Sub |
Row Details (only if any cell says “See details below”)
- None
Why does Pub Sub matter?
Business impact:
- Revenue: Enables reliable ingestion of customer events and transactions; prevents data loss that can drive revenue leakage.
- Trust: Ensures consistent user experience during traffic spikes; reduces missed notifications and SLA breaches.
- Risk: Poorly designed Pub Sub can cause data duplication, stale processing, or cascading failures that damage reputation.
Engineering impact:
- Incident reduction: Buffers and backpressure reduce downstream outages.
- Velocity: Teams can develop independently with minimal coupling.
- Complexity: Introduces distributed system concerns requiring investment in observability and SLOs.
SRE framing:
- SLIs: Delivery latency, success rate, processing backlog.
- SLOs: E.g., 99.9% deliver within 500ms for critical topics.
- Error budget: Drives rate-limiting and feature rollout decisions.
- Toil: Manual replay and debugging of events are common toil sources.
- On-call: Alerting must separate broker health vs consumer lag to reduce noise.
What breaks in production (realistic examples):
- Consumer lag spikes causing business process delays (e.g., payments delayed).
- Duplicate deliveries leading to double billing due to at-least-once semantics and idempotency missing.
- Topic partition leader failure causing temporary ordering violations.
- Misconfigured retention leading to data loss and inability to replay for recovery.
- Uncontrolled fan-out during a traffic storm overwhelms third-party APIs.
Where is Pub Sub used? (TABLE REQUIRED)
| ID | Layer/Area | How Pub Sub appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Edge | Event ingestion from devices and gateways | Ingress rate, error rate, auth failures | MQTT brokers, Kafka edge bridges |
| L2 | Network | Event routing between datacenters | Replication lag, throughput, RTT | Network proxies, message routers |
| L3 | Service | Microservice integration via events | Consumer lag, process time, retries | Kafka, NATS, Pulsar |
| L4 | Application | UI updates, notifications, websockets | Delivery latency, drop rates | Managed Pub Sub, Webhook gateways |
| L5 | Data | ETL and analytics pipelines | Retention usage, replay counts | Kafka, cloud Pub Sub, CDC tools |
| L6 | Cloud infra | Control plane events and autoscaling | Event volume, throttles | Cloud-native Pub Sub, Event Grid |
| L7 | CI/CD | Build triggers and pipeline events | Event success, latency | Pub Sub for triggers, message queues |
| L8 | Observability | Telemetry and trace enrichment | Sampling rate, delivery success | Telemetry pipelines, message brokers |
| L9 | Security | Audit event streaming and alerts | Log delivery, integrity checks | Event brokers, SIEM ingestion |
| L10 | Serverless | Function triggers and fan-out | Invocation rate, cold starts | Managed Pub Sub, EventBridge, Cloud Pub/Sub |
Row Details (only if needed)
- None
When should you use Pub Sub?
When necessary:
- You need decoupling between producers and consumers.
- High fan-out or broadcast distribution is required.
- Buffering or smoothing bursts protects downstream services.
- Replays and event retention for recovery or analytics are required.
When optional:
- Simple synchronous RPC suffices (low latency, tight coupling).
- Low throughput and simple workflows where HTTP webhooks are adequate.
When NOT to use / overuse it:
- For transactional strong consistency across services without compensating patterns.
- When a single-authority datastore solves consistency better.
- If you lack observability or staffing to operate a distributed event platform.
Decision checklist:
- If asynchronous communication and resilience to spikes are required -> use Pub Sub.
- If strict single-writer consistency is required -> prefer direct database or consensus.
- If event ordering across many producers matters -> confirm partitioning and ordering guarantees.
- If consumers require full event replay and audit -> choose a retained stream with adequate retention.
Maturity ladder:
- Beginner: Managed cloud Pub Sub with simple topics and a few consumers.
- Intermediate: Partitioned topics, consumer groups, replay strategies, and DLQs.
- Advanced: Geo-replication, multi-tenancy, exactly-once semantics, schema governance, and automated scaling.
How does Pub Sub work?
Components and workflow:
- Publisher: produces messages and sends to a topic or stream.
- Broker/Cluster: receives messages, persists them according to retention and replication rules.
- Topic/Partition: logical channel for events; partitions provide parallelism.
- Subscription/Consumer Group: tracks which consumers have processed which messages.
- Delivery mechanism: push (server pushes to endpoint) or pull (consumer polls).
- Acknowledgment: consumers ack processed messages to remove or advance offsets.
- Dead-letter queue (DLQ): stores messages that repeatedly fail processing.
Data flow and lifecycle:
- Producer serializes event and publishes to topic.
- Broker writes to partition log and replicates to peers.
- Broker either pushes to subscriber endpoints or records for consumers to pull.
- Consumer receives message, processes, and acknowledges.
- Broker advances subscription offset or removes message if retention or commit dictates.
- Failures trigger retry, backoff, or DLQ delivery.
Edge cases and failure modes:
- Broker node crash during commit causing offset gaps.
- Network partitions causing split-brain or duplicate delivery.
- Schema evolution causing consumer parsing errors.
- Time-skewed producers emitting events with out-of-order timestamps.
Typical architecture patterns for Pub Sub
- Fan-out broadcast: One publisher, many independent subscribers for notifications and UI updates.
- Consumer groups for scaling: Multiple consumers share a partitioned topic with at-least-once processing.
- Pipeline chaining: Events flow topic-to-topic through processing stages (ETL, enrichment).
- CQRS/Event sourcing: Events are the primary source of truth and fed into projections.
- Edge ingestion with regional brokers: Local brokers at edge ingest and replicate to central topics.
- Serverless event triggers: Managed Pub Sub triggers functions in response to messages.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | Consumer lag spike | Growing backlog | Slow consumer or outage | Auto-scale consumers and backpressure | Lag metric rising |
| F2 | Duplicate messages | Duplicate processing results | At-least-once delivery, idempotency missing | Implement idempotency keys or de-dup | Duplicate event count |
| F3 | Message loss | Missing business outcomes | Retention misconfig or ack bug | Increase retention and verify acks | Consumer offset holes |
| F4 | Broker CPU/IO overload | High latency and errors | Hot partition or insufficient capacity | Rebalance partitions and scale cluster | Broker latency and CPU |
| F5 | Ordering violation | Out-of-order events consumed | Partitioning mismatch | Ensure partition key aligns with ordering need | Ordering error rate |
| F6 | Backpressure cascade | Downstream timeouts | Unbounded fan-out and sync calls | Add rate limits and buffering | Downstream error spike |
| F7 | Schema mismatch | Deserialization errors | Schema evolution without compatibility | Enforce schema registry and compatibility | Schema error logs |
| F8 | Security breach | Unauthorized access to topics | Misconfigured ACLs or keys leaked | Rotate keys and apply least privilege | Unauthorized access audit |
| F9 | DLQ explosion | Large DLQ growth | Bug or malformed messages | Monitor DLQ and implement retry policies | DLQ size metric |
| F10 | Cross-region lag | Replication delays | Bandwidth or replication config | Tune replication and bandwidth | Replication lag metric |
Row Details (only if needed)
- None
Key Concepts, Keywords & Terminology for Pub Sub
Glossary (40+ terms). Each line: Term — definition — why it matters — common pitfall
- Topic — Named channel for messages — central routing unit — confusing with queue
- Partition — Subdivision of a topic for scale — enables parallelism — uneven key distribution
- Offset — Position in a partition — tracks progress — miscommits cause reprocessing
- Consumer group — Set of consumers sharing work — scaling and fault tolerance — misconfiguration leads to duplicates
- Broker — Server handling message storage and delivery — core component — single point of failure if unreplicated
- Publisher — Produces messages — entrypoint — backpressure handling often ignored
- Subscriber — Receives messages — processing element — failure handling often weak
- Acknowledgment (ack) — Signals successful processing — prevents redelivery — ack incorrect semantics cause loss
- Nack — Negative ack for retry — enables retries — can cause thundering retries
- Retention — How long messages are kept — enables replay — retention too short causes data loss
- Dead-letter queue (DLQ) — Stores failed messages — debug and recover — can fill rapidly
- Delivery semantics — At-most-once/At-least-once/Exactly-once — critical for correctness — exactly-once costs more
- Exactly-once — Guarantees single processing — simplifies semantics — often limited and expensive
- At-least-once — Ensures delivery at least once — resilient — requires idempotency
- At-most-once — May drop messages — low duplication — risks loss
- Ordering guarantee — Message order preservation — required for some workflows — partitioning complicates it
- Schema registry — Central schema store — prevents decoding errors — lack causes runtime failures
- Serialization — Message encoding e.g., JSON, Avro — affects size and speed — human-readable can be costly
- Compression — Reduce bandwidth and cost — improves throughput — adds CPU overhead
- Fan-out — One-to-many delivery — supports notifications — can overwhelm consumers
- Backpressure — Mechanism to slow producers — prevents overload — often unimplemented
- Rate limiting — Throttle events — protects downstream — can increase latency
- Rebalance — Reassign partitions among consumers — supports scaling — causes transient downtime
- Leader election — Partition leader selection — ensures coordination — leader loss causes latency
- Replication factor — Number of copies of data — durability — increases storage and network cost
- Throughput — Messages per second — capacity measure — ignoring peaks causes outages
- Latency — Time to deliver message — user experience impact — not all systems optimize for low latency
- Exactly-once transactions — Atomic publish and consume — simplifies logic — requires specialized support
- Idempotency — Ability to apply operation multiple times safely — avoids duplicates — requires design effort
- Checkpoint — Saved consumer progress — faster recovery — stale checkpoints cause replay
- Compaction — Retention by key keeping latest — useful for stateful data — can surprise expectations
- Log-based storage — Append-only design — efficient for replay — needs cleanup strategies
- Geo-replication — Cross-region replication — disaster recovery — introduces consistency trade-offs
- Brokerless — Peer-to-peer Pub Sub — reduces single point — more complex coordination
- Multitenancy — Hosting multiple tenants on one cluster — cost-efficient — risk of noisy neighbor
- Observability — Metrics, traces, logs for Pub Sub — operationally vital — often under-instrumented
- Schema evolution — Changes to message structure — supports forward/backward compatibility — breaking changes hurt consumers
- Consumer lag — Unprocessed messages count — indicates degradation — alarms often too noisy
- Payload size limit — Max message size — affects design — large messages need object storage
- Delivery mode — Push vs pull — affects flow control — pull gives more consumer control
- Broker quota — Resource caps per tenant — prevents abuse — poorly set quotas block valid workloads
- TLS/auth — Encryption and identity — secures channels — misconfiguration breaks connectivity
- Event sourcing — Persisting events as source of truth — powerful for audit — adds complexity to writes
- CDC — Change Data Capture — streams DB changes — integrates legacy systems — schema drift risk
- Message header — Metadata for routing — enables filtering — inconsistent use creates coupling
How to Measure Pub Sub (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | Delivery success rate | Percent messages delivered | delivered / published | 99.99% for critical | Counting duplicates skews rate |
| M2 | End-to-end latency | Time from publish to ack | 95th percentile of delivery time | p95 < 500ms for low-latency | Clock skew affects measurement |
| M3 | Consumer lag | Unprocessed message count | latest offset – committed offset | Keep near zero for realtime | Spikes during deploys normal |
| M4 | Throughput | Messages/sec | publish rate over window | Baseline + 3x headroom | Bursts require burst metrics |
| M5 | DLQ rate | Failures moved to DLQ | DLQ messages / total | Aim for near zero | DLQs used for valid debugging |
| M6 | Replay rate | Frequency of replays | replay events / time | Low in steady state | Flaky consumers increase replays |
| M7 | Broker CPU usage | Load on brokers | CPU percent | <70% steady state | Short spikes can be normal |
| M8 | Replication lag | Time lag between replicas | replica offset lag | < 1s for critical topics | Network issues inflate this |
| M9 | Schema error rate | Deserialization failures | schema errors / total | Target 0.01% | Evolving producers cause bursts |
| M10 | Auth failures | Unauthorized attempts | auth fails / attempts | Near zero | Misconfigurations cause alerts |
| M11 | Message size distribution | Impacts throughput and cost | histogram of sizes | Keep median small | Large tails increase cost |
| M12 | Consumer restart rate | Instability of consumers | restarts / time | Low steady rate | Deployments cause spikes |
Row Details (only if needed)
- None
Best tools to measure Pub Sub
Provide 5–10 tools each with H4 structure.
Tool — Prometheus + Grafana
- What it measures for Pub Sub: Broker and consumer metrics, latency, lag, resource usage.
- Best-fit environment: Kubernetes, self-managed clusters.
- Setup outline:
- Export broker metrics via exporters or built-in endpoints.
- Instrument consumers for offsets and processing time.
- Scrape metrics with Prometheus and visualize in Grafana.
- Add recording rules for SLIs.
- Configure alerting via Alertmanager.
- Strengths:
- Flexible, wide adoption.
- Good query and dashboard capabilities.
- Limitations:
- Storage retention trade-offs and cardinality issues.
- Requires maintenance and scaling.
Tool — Managed Cloud Observability (Varies by vendor)
- What it measures for Pub Sub: Integrated metrics, traces, and logs for cloud managed brokers.
- Best-fit environment: Cloud-managed Pub Sub services.
- Setup outline:
- Enable provider’s monitoring for Pub Sub.
- Configure IAM roles to collect metrics.
- Integrate with alerting and incident platforms.
- Strengths:
- Low setup overhead.
- Tight integration with managed services.
- Limitations:
- Vendor lock-in.
- Less customization.
Tool — OpenTelemetry + Tracing backend
- What it measures for Pub Sub: Trace spans across publish and consume boundaries.
- Best-fit environment: Distributed systems requiring end-to-end tracing.
- Setup outline:
- Instrument producers and consumers with OpenTelemetry.
- Capture publish and process spans.
- Export traces to backend and correlate with message IDs.
- Strengths:
- End-to-end visibility across services.
- Context propagation across async boundaries.
- Limitations:
- High cardinality and sampling decisions matter.
- Some brokers need custom instrumentation.
Tool — Kafka Connect + Connectors
- What it measures for Pub Sub: Throughput, connector lag, sink health.
- Best-fit environment: Kafka-centric data pipelines.
- Setup outline:
- Deploy connectors for sinks and sources.
- Monitor task failures and connector metrics.
- Configure offset commit and error handling.
- Strengths:
- Simplifies integrations.
- Scales well for data movement.
- Limitations:
- Connector quality varies.
- Operational overhead for many connectors.
Tool — Cloud Cost & Usage Monitoring
- What it measures for Pub Sub: Cost per topic, ingress/egress billing.
- Best-fit environment: Managed cloud Pub Sub or broker billing.
- Setup outline:
- Tag topics and monitor billing per tag.
- Correlate message volume with cost.
- Set budget alerts for spikes.
- Strengths:
- Controls cost surprises.
- Helps justify optimizations.
- Limitations:
- Attribution can be approximate.
- Not all clouds provide fine-grained per-topic cost.
Recommended dashboards & alerts for Pub Sub
Executive dashboard:
- Panels: Overall delivery success rate, total throughput, DLQ volume trend, major topic health.
- Why: High-level indicators for business stakeholders and reliability.
On-call dashboard:
- Panels: Consumer lag per critical topic, broker CPU/memory, top DLQ messages, top error types.
- Why: Rapid diagnosis and triage during incidents.
Debug dashboard:
- Panels: Per-partition latency, recent failed message samples, schema error traces, consumer offsets timeline.
- Why: Deep dive to troubleshoot correctness and ordering.
Alerting guidance:
- Page vs ticket:
- Page for: Broker down, replication stall, sustained consumer lag for critical topics.
- Ticket for: Minor DLQ growth or single consumer failure with automated retries.
- Burn-rate guidance:
- Use error budget burn rate to escalate deployments; alert when burn > X% over timeframe depending on SLO.
- Noise reduction tactics:
- Deduplicate identical alerts, group by topic or cluster, suppress during known maintenance windows.
Implementation Guide (Step-by-step)
1) Prerequisites: – Define topics and ownership. – Decide delivery semantics and retention. – Design partition keys and schema contract. – Ensure security and IAM plan.
2) Instrumentation plan: – Emit publish and consume metrics. – Attach message IDs and trace context. – Record payload size and schema version.
3) Data collection: – Centralize metrics, traces, and logs. – Store message samples securely for debugging. – Configure DLQs and monitoring.
4) SLO design: – Define SLIs: delivery success and latency. – Set SLOs: e.g., p95 delivery < 500ms, 99.9% success. – Define error budgets and escalation paths.
5) Dashboards: – Build exec, on-call, debug dashboards. – Include historical trends and capacity planning panels.
6) Alerts & routing: – Define alert thresholds for lag, latency, and DLQ rates. – Route critical alerts to paging and others to tickets. – Add runbook links to alerts.
7) Runbooks & automation: – Runbooks for consumer restart, DLQ triage, scaling brokers. – Automation for scale-up, rebalances, and retention changes.
8) Validation (load/chaos/game days): – Run load tests for peak throughput. – Inject failures: broker restart, network partition, consumer crash. – Business game days focusing on recovery and replay.
9) Continuous improvement: – Review incidents, refine SLOs. – Automate recurring runbook steps. – Prune unused topics and tune retention.
Pre-production checklist:
- Topics defined with retention and partition counts.
- Schema registered and compatibility checks in place.
- Test producers and consumers with simulated loads.
- Observability and alerts configured.
Production readiness checklist:
- Capacity planning and autoscaling configured.
- ACLs and encryption at rest/in transit enabled.
- DLQs and retry policies enabled.
- Runbooks accessible and tested.
Incident checklist specific to Pub Sub:
- Check broker cluster health and leader state.
- Inspect consumer lag and restart events.
- Verify DLQ growth and sample failed messages.
- If needed, scale consumers or brokers and execute runbook steps.
- Record event IDs for postmortem.
Use Cases of Pub Sub
Provide 8–12 use cases:
-
Real-time analytics ingestion – Context: High-volume clickstream events. – Problem: Need scalable ingestion and replay for analytics. – Why Pub Sub helps: Buffers bursts and retains data for batch/stream processing. – What to measure: Throughput, retention utilization, consumer lag. – Typical tools: Kafka, managed cloud Pub Sub.
-
Notification delivery – Context: Multi-channel user notifications. – Problem: Many recipients and channels with independent retries. – Why Pub Sub helps: Fan-out to channel adapters and decouples processing. – What to measure: Delivery success by channel, DLQ rates. – Typical tools: Pub Sub + worker pools.
-
Microservices integration – Context: Domain events across services. – Problem: Avoid synchronous tight coupling. – Why Pub Sub helps: Loose coupling and independent deployment. – What to measure: Event schema errors, end-to-end latency. – Typical tools: NATS, Kafka, cloud Pub Sub.
-
Serverless triggers – Context: Event-driven functions for processing. – Problem: Scale functions without writing glue code. – Why Pub Sub helps: Managed triggers and auto-scaling. – What to measure: Invocation rate, function cold starts. – Typical tools: Managed cloud Pub Sub or EventBridge.
-
ETL pipelines and CDC – Context: Capture DB changes into analytics store. – Problem: Reliable and ordered change propagation. – Why Pub Sub helps: Stream durability and ordering per key. – What to measure: CDC lag, schema compatibility. – Typical tools: Kafka Connect, Debezium.
-
IoT telemetry at the edge – Context: Massive device telemetry ingestion. – Problem: Intermittent connectivity and diverse devices. – Why Pub Sub helps: Local buffering and replication. – What to measure: Ingress rate, auth failures, replication lag. – Typical tools: MQTT, Kafka edge bridges.
-
Audit and security event streaming – Context: Centralized security analysis. – Problem: Multiple services generating audit logs. – Why Pub Sub helps: Central pipeline for SIEM and analytics. – What to measure: Delivery success, event integrity. – Typical tools: Pub Sub + log forwarders.
-
CI/CD pipeline orchestration – Context: Build and deploy event triggers. – Problem: Decouple pipeline stages and provide retries. – Why Pub Sub helps: Event triggers and fan-out to parallel tasks. – What to measure: Trigger latency, failed pipelines. – Typical tools: Pub Sub + orchestration services.
-
Feature flag and config propagation – Context: Dynamic feature rollout. – Problem: Reach many services reliably and instantly. – Why Pub Sub helps: Broadcast updates and support rollback. – What to measure: Propagation latency, version mismatch rate. – Typical tools: Pub Sub with service listeners.
-
Model inference and AI pipelines – Context: Real-time feature updates and prediction streaming. – Problem: High-throughput, low-latency events to models. – Why Pub Sub helps: Buffering, replay for retraining, fan-out to multiple models. – What to measure: Prediction latency, input data completeness. – Typical tools: Kafka, Pulsar, managed Pub Sub.
Scenario Examples (Realistic, End-to-End)
Scenario #1 — Kubernetes event-driven worker pool
Context: A Kubernetes cluster processes image uploads from users.
Goal: Scale workers and avoid dropping messages during spikes.
Why Pub Sub matters here: Decouples API pods from CPU-intensive image processors and allows horizontal scaling.
Architecture / workflow: API -> Publish to topic (Kafka/NATS) -> ConsumerGroup of workers in K8s -> Process images -> Ack.
Step-by-step implementation:
- Deploy Kafka cluster or use managed Kafka.
- Create topic with sufficient partitions.
- Implement producers in API pods to publish image tasks.
- Deploy worker Deployment with autoscaling based on consumer lag.
- Use DLQ for failed images after retries.
- Instrument metrics and dashboards.
What to measure: Consumer lag, processing latency, DLQ rate, pod CPU.
Tools to use and why: Kafka for throughput; KEDA for autoscaling; Prometheus/Grafana for metrics.
Common pitfalls: Wrong partition key causing hotspots.
Validation: Load test with simulated uploads and monitor lag.
Outcome: Smooth handling of spikes and predictable throughput.
Scenario #2 — Serverless invoice processing (serverless/managed-PaaS)
Context: SaaS platform generates invoices and needs asynchronous PDF generation.
Goal: Trigger serverless functions reliably and scale with traffic.
Why Pub Sub matters here: Managed Pub Sub triggers functions without provisioning compute.
Architecture / workflow: Billing service -> Cloud Pub Sub topic -> Function triggered -> Generate PDF -> Store in object store.
Step-by-step implementation:
- Create managed topic and subscription with push or function trigger.
- Deploy function with retry and idempotency using invoice ID.
- Configure DLQ for failed deliveries.
- Monitor function invocations and Pub Sub metrics.
What to measure: Invocation success rate, end-to-end latency, DLQ entries.
Tools to use and why: Managed cloud Pub Sub and serverless functions for low ops overhead.
Common pitfalls: Missing idempotency causing duplicate invoice emails.
Validation: Stress test with burst of invoice events and verify storage.
Outcome: Scalable invoice processing with minimal operational burden.
Scenario #3 — Incident-response notification pipeline (incident-response/postmortem)
Context: SRE team needs immediate alerting for system anomalies plus audit trail.
Goal: Fan-out incident events to multiple listeners reliably.
Why Pub Sub matters here: Single event can notify chat, pager, dashboard, and audit sinks.
Architecture / workflow: Monitoring -> Pub Sub topic -> Subscribers: Pager service, Chat bot, Logging sink -> Ack.
Step-by-step implementation:
- Configure monitoring to publish alerts to topic.
- Implement subscribers with dedupe and rate limiting.
- Route critical alerts to paging and others to tickets.
- Keep audit sink subscribed for compliance.
What to measure: Delivery success to pager, DLQ for failed notifications.
Tools to use and why: Pub Sub to distribute event once; alerting integration for paging.
Common pitfalls: Fan-out loops causing repeated alerts.
Validation: Simulate incident and verify all sinks received one notification.
Outcome: Reliable multi-channel incident notification and audit trail.
Scenario #4 — Cost vs performance batch vs streaming trade-off (cost/performance)
Context: Analytics team needs near-real-time metrics but budget is constrained.
Goal: Balance cost and latency by selecting streaming vs batch ingestion for different datasets.
Why Pub Sub matters here: Enables both low-latency streaming for critical metrics and batched topics for cheaper processing.
Architecture / workflow: Producers tag events as realtime or bulk -> Realtime topic for streaming processors -> Bulk topic aggregated and processed hourly.
Step-by-step implementation:
- Classify event types by latency sensitivity.
- Create topics with different retention and partitioning.
- Route critical events to low-latency stream, others to batch topic.
- Monitor cost per topic and tune retention.
What to measure: Cost per million messages, latency p95, consumer cost.
Tools to use and why: Kafka or managed Pub Sub for streaming; cloud batch tools for bulk.
Common pitfalls: Misclassification causing cost spikes.
Validation: Cost modeling and controlled switch of event types.
Outcome: Controlled costs with preserved critical low-latency paths.
Common Mistakes, Anti-patterns, and Troubleshooting
List of 20 common mistakes with Symptom -> Root cause -> Fix:
- Symptom: Growing consumer lag -> Root cause: Slow consumer logic -> Fix: Optimize processing and auto-scale.
- Symptom: Duplicate side-effects -> Root cause: At-least-once without idempotency -> Fix: Implement idempotency keys.
- Symptom: Topic hot partitions -> Root cause: Poor partition key choice -> Fix: Use better partitioning or increase partitions.
- Symptom: Unexpected message loss -> Root cause: Short retention or mis-acks -> Fix: Increase retention and verify ack logic.
- Symptom: High broker CPU -> Root cause: Large messages and compression cost -> Fix: Limit payload size and tune compression.
- Symptom: Deserialization errors -> Root cause: Schema evolution mismatch -> Fix: Use schema registry and compatibility rules.
- Symptom: DLQ filling quickly -> Root cause: Unhandled consumer exceptions -> Fix: Improve error handling and backoff.
- Symptom: Alert storms -> Root cause: Poor alert thresholds and correlation -> Fix: Group alerts and add debouncing.
- Symptom: Cross-region lag -> Root cause: Insufficient replication config -> Fix: Increase bandwidth or tune replication.
- Symptom: Unauthorized access -> Root cause: Misconfigured ACLs -> Fix: Enforce least privilege and rotate keys.
- Symptom: High cost on cloud Pub Sub -> Root cause: Unbounded retention and high egress -> Fix: Tune retention and compress messages.
- Symptom: Rebalance thrashing -> Root cause: Frequent consumer restarts -> Fix: Stabilize consumers and stagger deployments.
- Symptom: Missing business events -> Root cause: Downstream sync call failure in consumer -> Fix: Buffer calls and retry isolated work.
- Symptom: Trace gaps across async boundaries -> Root cause: Missing context propagation -> Fix: Attach trace IDs to messages.
- Symptom: Slow replays -> Root cause: Inefficient consumer reprocessing -> Fix: Parallelize replay and use bulk reads.
- Symptom: No visibility into message contents -> Root cause: Logs forbidden due to PII -> Fix: Mask sensitive fields and store sample messages securely.
- Symptom: No schema governance -> Root cause: Ad-hoc changes by teams -> Fix: Central schema registry and review process.
- Symptom: Fan-out overload of downstream APIs -> Root cause: Synchronous processing in subscribers -> Fix: Introduce throttling and batching.
- Symptom: Large cardinality in metrics -> Root cause: Per-message labels used in metrics -> Fix: Aggregate metrics before emitting.
- Symptom: Slow incident response -> Root cause: Missing runbooks for Pub Sub incidents -> Fix: Create and test runbooks regularly.
Observability-specific pitfalls (5):
- Symptom: Missing consumer lag metric -> Root cause: Not instrumenting offsets -> Fix: Emit committed offsets and compute lag.
- Symptom: No end-to-end traces -> Root cause: Dropped trace context on publish -> Fix: Include trace IDs in message headers.
- Symptom: Unclear DLQ reasons -> Root cause: No sampling of failed payloads -> Fix: Store sanitized sample messages for debugging.
- Symptom: Alert noise from transient lag -> Root cause: Thresholds too low and no smoothing -> Fix: Use sustained window and burn-rate logic.
- Symptom: High cardinality causing Prometheus OOM -> Root cause: Topic or message IDs in labels -> Fix: Use bounded labels and histograms.
Best Practices & Operating Model
Ownership and on-call:
- Define topic ownership per team.
- Assign on-call for platform-level incidents and another for consumer teams.
- Use clear escalation paths for cross-team incidents.
Runbooks vs playbooks:
- Runbooks: Step-by-step actionable procedures (restart, scale, replay).
- Playbooks: Higher-level strategies and decision guidance for complex incidents.
Safe deployments:
- Canary publishers to test new schema/format.
- Rolling consumer restarts and staggered rollouts to minimize rebalance impact.
- Automated rollback triggered by SLO or error budget breaches.
Toil reduction and automation:
- Automate replay mechanics and bulk validation.
- Auto-scale consumers based on lag and throughput.
- Automate alert remediation for known issues.
Security basics:
- Enforce TLS and mutual auth where supported.
- Use IAM and per-topic ACLs.
- Encrypt at rest and rotate credentials regularly.
- Audit topic access and set quotas for tenants.
Weekly/monthly routines:
- Weekly: Check DLQ patterns, consumer restart trends.
- Monthly: Review retention costs, partition skew, schema changes.
- Quarterly: Capacity planning and disaster recovery drills.
What to review in postmortems:
- Root cause and whether Pub Sub configuration contributed.
- Message loss or duplication analysis.
- SLO breaches and error budget consumption.
- Any schema or consumer changes that caused the issue.
- Action items to prevent recurrence.
Tooling & Integration Map for Pub Sub (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | Broker | Stores and delivers messages | Producers, consumers, schema registry | Core of any Pub Sub system |
| I2 | Managed Pub Sub | Cloud-hosted broker service | Cloud IAM, monitoring, functions | Low ops but vendor-specific |
| I3 | Schema registry | Central schema storage | Producers, consumers, CI | Enforces compatibility |
| I4 | Connectors | Move data between systems | Databases, object stores, analytics | Speeds ETL development |
| I5 | Monitoring | Collects metrics and alerts | Prometheus, cloud metrics, tracing | Observability foundation |
| I6 | Tracing | End-to-end request tracing | OpenTelemetry, APMs | Critical for async context |
| I7 | DLQ processor | Handles failed messages | Worker systems, storage | Automates retry and inspection |
| I8 | Security/Audit | IAM, encryption, audit logs | SIEM, identity providers | Compliance and access control |
| I9 | Autoscaling | Scales consumers/brokers | KEDA, cluster autoscaler | Responds to lag and load |
| I10 | Cost tools | Tracks messaging cost | Billing APIs, tags | Controls spend and optimization |
Row Details (only if needed)
- None
Frequently Asked Questions (FAQs)
What is the difference between Pub Sub and a message queue?
Pub Sub is fan-out and topic-based, while a message queue is typically point-to-point. Pub Sub scales for many subscribers; queues distribute among consumers.
Can Pub Sub guarantee message ordering?
Some systems guarantee ordering per partition or key. Global ordering across all producers is generally not guaranteed.
Is Pub Sub suitable for transactional systems?
Not for strong ACID transactions across services; use compensating transactions or event-driven sagas when needed.
How do I avoid duplicate processing?
Design idempotent consumers or use deduplication IDs and store processed IDs atomically.
What retention should I set for topics?
Depends on recovery needs; balance cost and replay requirements. No universal value.
How to choose partition keys?
Choose keys aligned with ordering and load distribution, avoid high-cardinality hotspots.
What are common delivery semantics?
At-most-once, at-least-once, and exactly-once depending on broker and configuration.
How do I debug missing messages?
Check retention settings, consumer offsets, broker logs, and DLQs to trace the message lifecycle.
How to secure Pub Sub?
Use TLS, IAM/ACLs, encryption at rest, and audit logging. Rotate keys and enforce least privilege.
How to handle schema evolution?
Use a schema registry and enforce backward/forward compatibility before rolling changes.
Can serverless functions scale with Pub Sub?
Yes, managed cloud Pub Sub systems integrate function triggers and scale automatically with events.
How do I measure end-to-end latency?
Capture timestamps at publish and at consumer ack and compute percentile latencies, accounting for clock skew.
Is Pub Sub cost-effective for low-volume systems?
Managed Pub Sub may be reasonable; evaluate per-message costs and retention charges.
How to deal with noisy neighbors in multitenant topics?
Use quotas, partitions, or separate topics per tenant and throttle aggressive producers.
What causes consumer lag?
Slow processing, insufficient consumer count, or sudden traffic spike; monitor and auto-scale.
Should I log full message payloads?
Avoid logging sensitive data; store sanitized or sampled payloads for debugging.
When to use DLQ vs retry?
Retries for transient consumer errors; DLQs for poison messages after retry exhaustion.
How to test Pub Sub changes safely?
Use canary topics, staging environments, and consumer contract tests before production rollout.
Conclusion
Pub Sub is a foundational pattern for modern distributed systems. It enables decoupling, scalability, and resilience when implemented with observability, security, and SLO discipline. The operational trade-offs require solid engineering practices and an SRE mindset.
Next 7 days plan (5 bullets):
- Day 1: Inventory topics and owners; map critical paths.
- Day 2: Ensure basic metrics and consumer lag monitoring are enabled.
- Day 3: Register schemas and define compatibility rules for critical topics.
- Day 4: Create runbooks for DLQ triage and consumer lag incidents.
- Day 5: Run a load test for one critical topic and validate autoscaling.
Appendix — Pub Sub Keyword Cluster (SEO)
- Primary keywords
- pub sub
- publish subscribe
- pub/sub architecture
- message broker
- pub sub tutorial
- event-driven architecture
- pub sub system
- pub sub pattern
- cloud pub sub
-
pub sub 2026
-
Secondary keywords
- at least once delivery
- exactly once semantics
- message retention
- topic partitioning
- consumer lag
- dead letter queue
- schema registry
- event streaming
- stream processing
-
serverless event triggers
-
Long-tail questions
- what is pub sub and how does it work
- when to use pub sub vs queues
- how to measure pub sub latency
- pub sub best practices for security
- how to design partition keys for pub sub
- pub sub monitoring and SLO examples
- pub sub error handling and DLQ strategies
- how to implement idempotency for pub sub consumers
- pub sub architecture patterns for microservices
-
how to scale pub sub consumers in kubernetes
-
Related terminology
- topic
- partition
- offset
- consumer group
- broker cluster
- producer
- subscriber
- ack nack
- replication factor
- compacted topic
- fan-out
- backpressure
- rate limiting
- schema evolution
- CDC
- Kafka
- Pulsar
- NATS
- MQTT
- event bus
- connector
- tracing
- OpenTelemetry
- Prometheus
- Grafana
- DLQ
- retention policy
- leader election
- autoscaling
- KEDA
- managed pub sub
- cloud event
- event sourcing
- idempotency
- checksum
- payload size
- partition skew
- geo replication
- multitenancy
- compliance logging