Quick Definition (30–60 words)
A Network load balancer is a high-performance Layer 4 traffic distributor that forwards TCP and UDP connections across backend targets to provide scalability and fault tolerance. Analogy: it is a traffic director at a highway interchange that routes vehicles to available lanes. Formal: a packet-level proxy or selector that balances transport layer flows using connection hashing, health checks, and routing rules.
What is Network load balancer?
A Network load balancer (NLB) is an infrastructure component that accepts incoming network connections at the transport layer and distributes them across multiple backend endpoints. It focuses on high throughput, low latency, and preserving client IP where needed. It is not an application-aware HTTP reverse proxy; it does not inspect application payloads or make routing decisions based on HTTP headers or content.
Key properties and constraints
- Operates at Layer 4 (TCP/UDP/QUIC variants).
- Preserves high packet throughput and low connection latency.
- Usually supports connection, source or destination IP hashing.
- Health checks are typically TCP or UDP probes.
- Limited or no L7 features like header manipulation, content routing, or WAF.
- May support static IPs or Elastic IPs for consistent endpoints.
- Often implements connection tracking and session affinity at transport level.
- Can be single-tenant or multi-tenant in managed clouds; performance and limits vary.
Where it fits in modern cloud/SRE workflows
- As the primary edge for non-HTTP workloads such as databases, MQTT, game servers, VoIP, and raw TCP APIs.
- In front of proxies, service clusters, or autoscaling pools that require predictable latency.
- Paired with L7 ingress proxies in layered architectures where L4 terminates TLS and L7 handles HTTP routing.
- Integrated in Kubernetes via Service type LoadBalancer or using external NLB controllers.
- Used in hybrid-cloud and inter-region connectivity as stable, static endpoints.
Diagram description (text-only)
- Client IPs flow to a static VIP or IP of NLB; NLB performs accept at transport layer; NLB consults its routing table and health state; NLB forwards or DNATs connection to chosen backend instance or pod; backend responds; NLB either performs SNAT or preserves source IP and forwards response back to client; health checks run separately to mark targets healthy or unhealthy.
Network load balancer in one sentence
A high-performance Layer 4 distributor that routes TCP/UDP connections to backend targets with low latency and high throughput while providing health-based availability.
Network load balancer vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from Network load balancer | Common confusion |
|---|---|---|---|
| T1 | Application load balancer | Operates at Layer 7 and understands HTTP semantics | Confused because both provide load balancing |
| T2 | Reverse proxy | Often L7 with request rewriting unlike NLB | People expect header manipulation from NLB |
| T3 | TCP proxy | Similar but may terminate TLS and inspect streams | TCP proxy sometimes used interchangeably |
| T4 | IPVS | Kernel-level L4 scheduler used by Kubernetes | IPVS is an implementation not a managed service |
| T5 | DNS load balancing | Uses DNS to distribute rather than per-connection routing | DNS changes are not instantaneous |
| T6 | Anycast routing | Network routing at BGP level vs NLB handles connections | Both provide global routing but different layers |
| T7 | Service mesh | L7 service-to-service features, sidecar based | Service mesh is application side vs NLB infra side |
| T8 | NAT gateway | Focused on outbound IP translation vs NLB inbound balancing | NAT does not balance inbound traffic |
| T9 | CDN | Caches content at edge, not per-connection L4 balancing | CDN is content delivery, not raw TCP routing |
| T10 | Stateful firewall | Focused on security policies not load distribution | Firewalls may inspect traffic but not balance |
Row Details (only if any cell says “See details below”)
Not needed.
Why does Network load balancer matter?
Business impact
- Revenue: Low-latency reliable connectivity for customer-facing real-time services directly impacts revenue for gaming, finance, or streaming products.
- Trust: Consistent IP endpoints and quick failover reduce customer-visible outages and improve SLAs.
- Risk: Misconfiguration or insufficient capacity at L4 can create single points of failure that cascade into large outages.
Engineering impact
- Incident reduction: Properly configured NLBs reduce target-level overload through health checks and connection draining.
- Velocity: Teams can deploy service replicas and autoscale without changing client endpoints.
- Complexity tradeoffs: Moving routing to L4 removes many application-level routing decisions but requires robust observability.
SRE framing
- SLIs/SLOs: Use availability and connection success SLIs; include latency percentiles for accept and backend connect times.
- Error budgets: Allocate budget for backend failures separate from NLB downtime; track client visible failures.
- Toil/on-call: Automate health recovery and capacity scaling; eliminate manual reconfiguration using IaC.
- On-call: Runbooks should include NLB health, target group checks, and emergency reroutes.
What breaks in production (realistic examples)
- Health-check misconfiguration marks all backends unhealthy after a deploy, causing traffic blackhole.
- Exhausted connection tracking table on NLB leads to new connection failures while existing sessions continue.
- Incorrectly sized connection timeouts cause mid-session disconnects for long-lived TCP sessions.
- Network ACL or security group rule disallows backend health probes, leaving NLB unaware of true backend state.
- Autoscaling rate-lags create a thundering herd on new instances causing connection resets.
Where is Network load balancer used? (TABLE REQUIRED)
| ID | Layer/Area | How Network load balancer appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Edge network | VIP or static IP accepting client TCP UDP | Connections per second and accepted decreases | Cloud NLBs and on-prem appliances |
| L2 | Service mesh border | North-south L4 gateway before mesh | Backend connect latency and failure rate | Ingress controllers and NLB integrations |
| L3 | Kubernetes | Service type LoadBalancer or NLB controller | Pod backend health and session affinity | Service controllers and cloud LB services |
| L4 | Serverless PaaS | External TCP endpoints for managed services | Invocation connection counts and errors | Managed platform load balancers |
| L5 | Hybrid connectivity | On prem to cloud VPN exit points | Throughput and packet loss | Transit gateways and NLBs |
| L6 | CI CD pipelines | Integration tests hitting NLB endpoints | Test connection success and latencies | Test harness and synthetic monitors |
| L7 | Observability plane | Telemetry collectors fronted by NLB | Metrics ingestion rates and dropped packets | Metrics endpoints and collectors |
| L8 | Security perimeter | DDoS protection paired with NLB | Attack traffic patterns and anomalies | DDoS scrubbing and NLB shields |
Row Details (only if needed)
Not needed.
When should you use Network load balancer?
When it’s necessary
- Non-HTTP protocols (TCP, UDP, QUIC) require L4 distribution.
- Low per-connection latency is critical (real-time trading, gaming).
- Need preserved client IP at backend for policy or logging.
- High connection throughput with minimal proxying overhead.
When it’s optional
- Simple HTTP services where L7 features are desired; consider ALB or reverse proxy.
- Small internal services with low traffic; simple DNS or cluster IP may suffice.
When NOT to use / overuse it
- For complex HTTP routing, rate limiting, or header-based auth—use L7 proxies or API gateways.
- For web caching or CDN needs.
- When trying to implement application logic (stick to app layer).
Decision checklist
- If low-latency TCP/UDP and client IP preserve needed -> Use NLB.
- If application needs header-based routing, WAF, or content rewrite -> Use ALB or reverse proxy.
- If global failover and geo-routing required at BGP level -> Consider anycast plus NLBs regionally.
Maturity ladder
- Beginner: Single NLB, static target group, basic health checks.
- Intermediate: Autoscaling integrated, connection draining, structured health checks, alerting.
- Advanced: Multi-region active passive/active with automated failover, telemetry-driven autoscaling, chaos tests and intelligent routing.
How does Network load balancer work?
Components and workflow
- Listener: Accepts connections on a VIP IP and port and selects protocol handler.
- Target pool / target group: The set of backend endpoints (VMs, containers, IPs) registered.
- Connection tracker: Tracks connections and state for affinity and lifecycle.
- Health checker: Probes backends to mark healthy/unhealthy.
- Scheduler: Chooses target per connection using algorithms (round robin, hash, least connections).
- NAT or proxy component: Optionally performs DNAT/SNAT or maintains transparent forwarding.
- Control plane: API to manage rules, targets, and monitoring.
- Data plane: Packet forwarding; optimized for kernel bypass or hardware acceleration.
Data flow and lifecycle
- Client initiates TCP/UDP to NLB VIP.
- NLB listener accepts connection and consults scheduler and health table.
- NLB selects a backend and either DNATs packet destination to backend or proxies it.
- Backend responds; NLB forwards response back to client preserving connection properties.
- Health checker probes backends periodically; unhealthy ones are drained and removed.
- Connection draining waits for long-lived sessions to finish or migrates based on capability.
Edge cases and failure modes
- Backend returning SYN-ACK but failing application-level processing will be invisible to L4 health checks.
- Half-open connections due to abrupt backend process exit can exhaust NLB tracking.
- Asymmetric routing if backend responses bypass NLB causes source IP mismatch.
- Large client pools may hit connection table limits leading to new connection rejections.
Typical architecture patterns for Network load balancer
- L4 edge with L7 reverse proxy behind it — Use when TLS termination needs to be offloaded or load distributed across proxies.
- NLB directly to stateful services — Use for databases, caches, or persistent TCP services requiring low latency.
- NLB in front of Kubernetes NodePorts — Useful when preserving client IP and using native cloud LBs.
- Multi-region active-passive with DNS failover — For global services that need regional isolation and controlled failover.
- Anycast VIP with local NLBs — Use when global single IP entry and local routing are required for low-latency global services.
- Hybrid architecture bridging on-prem and cloud backends — NLB as stable ingress for hybrid workloads.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | Health check flapping | Targets frequently marked unhealthy | Incorrect probe config or intermittent service | Fix probe path and increase thresholds | Health check success rate |
| F2 | Connection table exhaustion | New connections fail with timeouts | Too many concurrent connections | Increase limits or scale out | Connection tracker usage |
| F3 | Asymmetric routing | Responses bypass NLB and client drops | Backend SNAT or routing misconfig | Ensure symmetric paths or force SNAT | Packet drops and RTT variance |
| F4 | Long-lived session storm | Resource depletion and slow accept | Many persistent sessions not expired | Connection draining and idle timeouts | Session duration percentiles |
| F5 | Misrouted traffic | Connections hit wrong backend | Incorrect target registration | Verify target metadata and routing rules | Backend connection counts |
| F6 | DDoS pressure | Elevated packet rate and dropped connections | Layer4 attack or volumetric spike | Enable scrubbing and rate limit | Unusual PPS and byte rates |
| F7 | TLS handshake errors | Clients fail TLS during connect | TLS passthrough mismatch or SNI missing | Match TLS config or offload appropriately | TLS handshake failure metric |
Row Details (only if needed)
Not needed.
Key Concepts, Keywords & Terminology for Network load balancer
Glossary of essential terms (40+ entries)
- Listener — Accepts connections on a VIP and port — Primary entrypoint.
- VIP — Virtual IP address assigned to NLB — Stable endpoint for clients.
- Target group — Set of backend endpoints — Logical grouping for routing.
- Health check — Probe to determine backend health — Ensures availability.
- Connection tracking — State store for flows — Required for affinity and long sessions.
- DNAT — Destination NAT used to forward packets — Preserves backend addressing.
- SNAT — Source NAT used to rewrite client IP — Used when backend needs stable source.
- Session affinity — Keeps connections to same backend — Useful for stateful services.
- TCP probe — Health check using TCP handshake — Fast but not application-aware.
- UDP balancing — Handling of connectionless traffic — Requires special handling for sessions.
- QUIC support — UDP-based transport; may need unique handling — Emerging requirement.
- Idle timeout — Time to close inactive connections — Prevents tracker leakage.
- Connection draining — Graceful removal of backends — Prevents dropped sessions.
- Scheduling algorithm — Round robin, hash, least connections — Determines target selection.
- Source IP preserve — Keeps original client IP visible to backend — Important for logging and ACLs.
- Proxy mode — NLB proxies traffic rather than DNAT — Changes path and performance.
- Anycast — Same IP announced from multiple regions — For global low-latency routing.
- BGP — Routing protocol used in anycast or direct peering — For wide-area routing.
- Throughput — Bytes per second capacity — Sizing metric.
- PPS — Packets per second — Important for small-packet workloads.
- TLS passthrough — L4 forwarding of encrypted traffic — Backend handles TLS.
- TLS termination — NLB or edge terminates TLS — Enables L7 inspection.
- Sticky sessions — Another term for session affinity — Useful for stateful apps.
- Connection multiplexing — Multiple logical connections over one physical connection — Reduces load.
- Flow hashing — Deterministic mapping based on tuple — Ensures session stickiness.
- Load balancing algorithm — Decision logic for selection — Impacts distribution fairness.
- Control plane — API and management layer — Used for config and monitoring.
- Data plane — High-speed forwarding layer — Handles packets with low latency.
- Kernel bypass — Data plane optimization to avoid OS kernel — For performance.
- Hardware offload — Using NIC or ASIC features — For high throughput/low latency.
- Autoscaling integration — NLB reacts to autoscaled targets — Essential for dynamic workloads.
- Cross-zone balancing — Distributes load across zones — Reduces hotspot risk.
- Failover — Automated traffic switching on failure — Improves availability.
- Global load balancing — Multi-region traffic distribution — For resiliency and latency.
- Circuit breaking — Prevent overload by cutting traffic — Protects systems.
- Rate limiting — Controls traffic rate — DDoS and abuse mitigation.
- Observability — Metrics, logs, traces — For troubleshooting.
- Synthetic checks — External probes to ensure end-to-end connectivity — Complements health checks.
- Blackhole — State where traffic is accepted by NLB but not serviced — Critical outage mode.
- Chaos testing — Controlled disruption to validate resilience — Reduces surprise failures.
- Idle connection storm — Many idle but open sessions — Can exhaust resources.
- Flow stickiness — Consistency of client to backend mapping — Important for TCP sessions.
- Multi-protocol support — Ability to handle TCP UDP and sometimes ICMP — Broadens use.
- TLS session reuse — Performance optimization for TLS handshakes — Reduces CPU.
- Observability sampling — Reducing telemetry volume while preserving signal — Cost control.
How to Measure Network load balancer (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | Connection success rate | Percent of client connects that establish | Successful accepts divided by attempts | 99.95% | Counts depend on client retries |
| M2 | Backend connect latency | Time to establish backend TCP | Measure time from accept to backend SYN-ACK | p50 under 10ms p95 under 100ms | Includes network path to backend |
| M3 | Request latency at L4 | RTT from SYN to final ACK data | Packet timestamps at NLB | p95 under 200ms | Long-lived sessions skew averages |
| M4 | Connection table utilization | How full tracker is | Active connections divided by capacity | <75% | Hard limits differ by provider |
| M5 | Health check success rate | Backend probe pass ratio | Successful probes / total | 99.9% | Healthy probes may not indicate app health |
| M6 | Packet drop rate | Packets dropped in data plane | Dropped packets / received | <0.01% | Bursts can mask sustained issues |
| M7 | PPS and throughput | Capacity utilization | Measured at NLB eg packets per sec and Bps | Below 80% capacity | Small packet workloads increase PPS |
| M8 | TLS handshake failures | TLS session errors at L4 passthrough | TLS failures / total handshakes | <0.1% | Misconfig causes spikes |
| M9 | Connection draining time | Time to complete draining | Time from mark unhealthy to zero active | Configured max wait | Long sessions may block deployments |
| M10 | Autoscale reaction time | Time for backend pool to scale | Time from metric threshold to new healthy instance | <3min for cloud autoscale | Cold start times vary |
Row Details (only if needed)
Not needed.
Best tools to measure Network load balancer
Provide concise tool descriptions by required structure.
Tool — Prometheus
- What it measures for Network load balancer: Metrics like connection counts, health checks, and throughput via exporters.
- Best-fit environment: Kubernetes and cloud-native environments.
- Setup outline:
- Run exporters or use cloud provider metrics exporter.
- Scrape NLB and backend target metrics.
- Configure recording rules for SLIs.
- Aggregate and store in long-term storage if needed.
- Strengths:
- Flexible query language and alerting.
- Rich ecosystem of exporters.
- Limitations:
- Requires management for scale.
- High cardinality telemetry can be costly.
Tool — Grafana Cloud or Grafana OSS
- What it measures for Network load balancer: Visualizes Prometheus and other metrics for dashboards.
- Best-fit environment: Teams needing visual dashboards across infra.
- Setup outline:
- Connect Prometheus or cloud metrics source.
- Import templates for NLB dashboards.
- Create on-call views for alerts.
- Strengths:
- Flexible panels and annotations.
- Multi-datasource support.
- Limitations:
- Dashboards need maintenance.
- Alerting maturity depends on backend.
Tool — Cloud provider metrics (managed)
- What it measures for Network load balancer: Native metrics like connection attempts, healthy hosts, and bytes.
- Best-fit environment: Cloud-managed NLBs.
- Setup outline:
- Enable metrics collection.
- Tag NLB resources.
- Hook to alerting and dashboards.
- Strengths:
- Highly accurate and near real-time.
- No agent required.
- Limitations:
- Metric granularity or retention varies.
- Proprietary metric names require adaptation.
Tool — eBPF observability
- What it measures for Network load balancer: Packet-level tracing, latency breakdowns and connection tracking anomalies.
- Best-fit environment: On-prem or cloud instances where kernel-level visibility is allowed.
- Setup outline:
- Deploy eBPF probe tools.
- Capture connection lifecycle traces.
- Aggregate traces to correlate with metrics.
- Strengths:
- Low overhead and deep visibility.
- Can discover asymmetric routing.
- Limitations:
- Requires kernel support and privileges.
- Complexity in interpreting traces.
Tool — Synthetic monitors
- What it measures for Network load balancer: End-to-end connectivity and connection success from diverse vantage points.
- Best-fit environment: Public-facing services and multi-region setups.
- Setup outline:
- Schedule TCP/UDP probes from edge locations.
- Measure connect times and error rates.
- Alert on failures and latency spikes.
- Strengths:
- Real user perspective checks.
- Simple to interpret.
- Limitations:
- Additional cost and potential blind spots for internal-only paths.
Recommended dashboards & alerts for Network load balancer
Executive dashboard
- Panels:
- Overall availability SLI and burn rate: shows global success rate and trend.
- Throughput and PPS aggregated by region: capacity snapshot.
- Top impacted services and error budget remaining: business impact view.
- Why: Give product and leadership a clear availability and capacity snapshot.
On-call dashboard
- Panels:
- Real-time connection success rate per NLB: primary incident indicator.
- Health check failures by target group: fast isolate failing pools.
- Connection table usage and resource saturation: capacity alarms.
- Recent configuration changes and deployments: cause correlation.
- Why: Fast triage and route to affected teams.
Debug dashboard
- Panels:
- Per-backend connect latency distribution: isolate slow nodes.
- Packet drop rates and retransmit counts: network issues.
- Synthetic probe results and geographic breakdown: external visibility.
- Health check probe logs and recent events: root cause clues.
- Why: Deep dive metrics for engineers during incidents.
Alerting guidance
- Page vs ticket:
- Page on global connection success drops below SLO or high connection table usage >85% sustained.
- Ticket for non-urgent degradations like intermittent small increases in latency under SLO.
- Burn-rate guidance:
- If error budget burn rate exceeds 2x sustained over 1 hour, escalate to on-call review.
- Noise reduction tactics:
- Group alerts per NLB or target group.
- Deduplicate alerts across zones if upstream symptom matches.
- Suppress alerts during known maintenance windows.
Implementation Guide (Step-by-step)
1) Prerequisites – Inventory services requiring L4 balancing. – Define SLIs and SLOs for availability and latency. – Provision IAM, security groups, and network routes. – Choose target types (IP, VM, pod) and define health checks.
2) Instrumentation plan – Export NLB metrics and backend metrics. – Trace connection lifecycle where possible. – Tag resources with service and team ownership.
3) Data collection – Configure cloud provider metrics and Prometheus exporters. – Enable synthetic tests and logging for NLB events. – Centralize logs into structured storage.
4) SLO design – Set availability SLO for NLB-bound services and allocate error budgets. – Define latency SLOs for connection establishment and backend connect.
5) Dashboards – Build executive, on-call, and debug dashboards as above. – Include deployment and config change panels.
6) Alerts & routing – Implement alert rules for SLIs with thresholds and burn-rate monitoring. – Route alerts to appropriate on-call teams per service ownership.
7) Runbooks & automation – Create runbooks for common failures: health check misconfig, scaling, DDoS. – Automate routine remediation: auto scale, failover, and target re-registration.
8) Validation (load/chaos/game days) – Run synthetic load tests and scale tests. – Perform chaos experiments that simulate target failures, high connection rates, and asymmetric routing. – Validate failover and drain behavior.
9) Continuous improvement – Review incidents and update SLOs. – Regularly review capacity and plan scaling adjustments. – Automate observability and remediation steps.
Checklists
- Pre-production checklist:
- Health checks validated across zones.
- Synthetic probes configured from external vantage points.
- Instrumentation in place and dashboards loaded.
-
IaC templates for NLB created and code-reviewed.
-
Production readiness checklist:
- Autoscale policies tested under load.
- Connection limits validated and headroom available.
- On-call runbooks accessible and exercised.
-
DDoS protections and rate limits enabled.
-
Incident checklist specific to Network load balancer:
- Confirm NLB control plane health and recent config changes.
- Check health check metrics and target statuses.
- Inspect connection table utilization and timeouts.
- Validate network ACLs and firewall rules for probe paths.
- Decide emergency mitigation: increase capacity, reroute traffic, or rate-limit.
Use Cases of Network load balancer
Provide 8–12 use cases:
-
Real-time gaming servers – Context: Fast interactive TCP/UDP sessions with low latency. – Problem: High concurrency and low tolerance for added latency. – Why NLB helps: Low data-plane overhead and support for UDP/QUIC. – What to measure: Connection success, p95 connect latency, PPS. – Typical tools: Cloud NLB, eBPF for debugging.
-
Database proxies – Context: Fronting read replicas or proxy pools for DB traffic. – Problem: Need to distribute DB TCP connections reliably. – Why NLB helps: Stable IP and low latency routing. – What to measure: Backend connect latency, connection table utilization. – Typical tools: Managed NLB, connection poolers.
-
MQTT and IoT brokers – Context: Millions of devices maintaining TCP sessions. – Problem: Massive scale of long-lived connections. – Why NLB helps: Efficient tracking and draining. – What to measure: Active sessions, idle timeouts, health checks. – Typical tools: NLBs with large connection table capacity.
-
VoIP and SIP services – Context: UDP and TCP transport with jitter sensitivity. – Problem: Packet loss and asymmetric routing hurt call quality. – Why NLB helps: Packet forwarding optimized for low latency. – What to measure: Packet drop rate, jitter, RTT. – Typical tools: NLB and RTP-aware SBCs downstream.
-
Non-HTTP APIs (gRPC over TCP) – Context: gRPC services using transport-level connections. – Problem: High-throughput RPCs needing low overhead. – Why NLB helps: Transparent routing and preservation of performance. – What to measure: RPC failure rates and connect latency. – Typical tools: NLB + L7 proxies for advanced routing if needed.
-
Backend telemetry ingestion – Context: Metrics or logs collectors accepting TCP streams. – Problem: High ingestion throughput and bursty traffic. – Why NLB helps: High throughput and simple target scaling. – What to measure: Bytes per second, ingestion success. – Typical tools: NLB with autoscaled collector pools.
-
Hybrid cloud ingress – Context: Stable endpoints bridging on-prem to cloud services. – Problem: Need consistent IPs and reliable routing. – Why NLB helps: Static VIPs and health-aware routing to sites. – What to measure: Cross-site latency and failure rates. – Typical tools: NLB and transit gateways.
-
Managed PaaS TCP endpoints – Context: Platforms exposing custom TCP services. – Problem: Customers rely on stable IPs and failover. – Why NLB helps: Offloads distribution and provides static endpoints. – What to measure: Customer connection success and latencies. – Typical tools: Cloud-managed NLB services.
-
Multiplayer matchmaker – Context: Matching players and routing traffic to game servers. – Problem: Dynamic target sets and need for affinity. – Why NLB helps: Supports hashing and affinity for consistent routing. – What to measure: Match setup latency, backend connect times. – Typical tools: NLB and ephemeral backend pools.
-
High-frequency trading gateways – Context: Extremely low-latency TCP connections with determinism. – Problem: Even small jitter causes financial loss. – Why NLB helps: Hardware offload and kernel bypass for ultra-low latency. – What to measure: p99 connect latency, jitter. – Typical tools: Specialized NLB appliances or cloud offerings with NIC offload.
Scenario Examples (Realistic, End-to-End)
Scenario #1 — Kubernetes service for TCP game servers (Kubernetes scenario)
Context: A game company runs Kubernetes clusters where pods host game servers communicating over UDP/TCP. Goal: Provide stable external endpoints and low-latency routing across multiple zones while preserving player IPs. Why Network load balancer matters here: NLB offers high throughput and can forward UDP, preserve client IPs, and integrate with Kubernetes Service type LoadBalancer. Architecture / workflow: Player -> Cloud NLB VIP -> NodePort on nodes -> kube-proxy or NLB controller -> Pod game server. Step-by-step implementation:
- Configure Service type LoadBalancer with externalTrafficPolicy Local to preserve source IP.
- Use an NLB controller that registers pod IPs or nodeports depending on model.
- Set UDP/TCP health checks to the game server listen ports.
- Enable connection draining and set idle timeouts to accommodate gameplay.
- Instrument Prometheus for connection and pod metrics. What to measure: Active sessions per pod, connect success rate, p95 connect latency, packet loss. Tools to use and why: Cloud NLB, kube-service-controller, Prometheus, Grafana for dashboards. Common pitfalls: Misconfigured service mode causing SNAT and lost client IPs; health checks failing due to UDP semantics. Validation: Run regional load tests with realistic player session duration and verify preservation of player IPs and p95 latencies. Outcome: Scalable and low-latency ingress for multiplayer sessions.
Scenario #2 — Serverless TCP endpoint for managed PaaS (Serverless/managed-PaaS scenario)
Context: A managed PaaS exposes a custom TCP service that triggers serverless functions. Goal: Ensure stable IP and scale to bursts of traffic. Why Network load balancer matters here: NLB provides static endpoints and scales to accept bursts before serverless invocations cold start. Architecture / workflow: Client -> NLB -> TCP to managed connector -> invoke serverless worker. Step-by-step implementation:
- Provision NLB VIP and target pointing to managed connector service.
- Configure health checks and slow start for new connectors.
- Define autoscaling rules based on NLB metrics (connects per second).
- Add synthetic probes to ensure external reachability. What to measure: Connection success rate, cold start latency, invocation failures. Tools to use and why: Cloud NLB, platform connector logs, synthetic monitors. Common pitfalls: Cold starts causing initial connection backlog; connector misconfig causing SNI mismatch. Validation: Spike testing with burst traffic and measure invocation latency. Outcome: Stable front door for serverless TCP workloads with capacity to absorb bursts.
Scenario #3 — Incident response for failing backend health checks (Incident-response/postmortem scenario)
Context: An e-commerce platform sees traffic drop as NLB marks backends unhealthy after a deploy. Goal: Quickly restore traffic and identify root cause. Why Network load balancer matters here: NLB health checks directly determine if backends receive traffic. Architecture / workflow: Client -> NLB -> backend pool. Step-by-step implementation:
- Check NLB control plane health and probe failure counts.
- Validate recent deployment logs and config changes to health probe endpoints.
- Temporarily redirect traffic to fallback pool or enable a prior stable configuration.
- Reconfigure health checks and roll back unhealthy deployment. What to measure: Health check failure rate, deployment timestamp correlation, synthetic probe failures. Tools to use and why: Provider metrics, deployment CI logs, monitoring dashboards. Common pitfalls: Health checks using ephemeral ports; deployment removed listening process before readiness. Validation: Restore traffic and run synthetic check suite. Outcome: Rapid rollback and improved health check gating in the pipeline; postmortem updates to runbooks.
Scenario #4 — Cost vs performance optimization for high throughput NLB (Cost/performance trade-off scenario)
Context: A SaaS service pays for premium NLB instances but seeks cost reduction. Goal: Reduce cost while maintaining p95 latency and connection success. Why Network load balancer matters here: Balancer instance type and feature choices affect cost and performance. Architecture / workflow: Client -> NLB (premium) -> backend pool. Step-by-step implementation:
- Analyze telemetry: PPS, throughput, p95 latency, and connection drops.
- Evaluate options: downgrade instance class, use connection multiplexing, or add caching upstream.
- Run staged experiments: reduce NLB spec in non-prod and measure metrics.
- Consider architectural changes: move some traffic to L7 where caching helps. What to measure: p95 connect latency, error rate, cost per million connections. Tools to use and why: Billing data correlated with Prometheus metrics and synthetic tests. Common pitfalls: Underestimating impact of reduced PPS capacity or missing small-packet workloads. Validation: Canary traffic at reduced capacity and monitor SLIs. Outcome: Potential cost savings with minimal customer impact due to optimized routing and workload changes.
Common Mistakes, Anti-patterns, and Troubleshooting
List of common mistakes (15–25) with symptom -> root cause -> fix. Include observability pitfalls.
- Symptom: All backends marked unhealthy after deploy -> Root cause: Health check endpoint changed -> Fix: Add deploy readiness probe and configure health checks to use stable port.
- Symptom: New connections time out -> Root cause: Connection table full -> Fix: Increase table size or scale NLB, implement client backoff.
- Symptom: Client IP not visible at backend -> Root cause: SNAT enabled or externalTrafficPolicy Cluster -> Fix: Use externalTrafficPolicy Local or enable source IP preserve.
- Symptom: High packet retransmits -> Root cause: Network path issues between NLB and backend -> Fix: Verify network ACLs and route asymmetry.
- Symptom: TLS handshake fails intermittently -> Root cause: Mismatched TLS configs when passthrough expected -> Fix: Align TLS versions and SNI settings.
- Symptom: Unexpected latency spikes -> Root cause: Backends overloaded or cross-zone egress -> Fix: Scale backends and enable cross-zone load balancing.
- Symptom: Thundering herd during scale-out -> Root cause: Slow warm-up of backend leading to connection retries -> Fix: Implement slow start and grace periods.
- Symptom: Large billing spikes -> Root cause: Unbounded traffic due to misconfigured routes or DDoS -> Fix: Implement rate limiting and alerts on PPS anomalies.
- Symptom: Inconsistent routing across regions -> Root cause: Anycast misconfiguration or DNS TTL issues -> Fix: Validate anycast announcements and DNS settings.
- Symptom: Observability gaps for L4 metrics -> Root cause: Relying only on L7 traces -> Fix: Add NLB metrics and synthetic L4 probes.
- Symptom: False healthy targets -> Root cause: Health checks only verifying TCP accept not app readiness -> Fix: Use enriched health checks or sidecar probes.
- Symptom: Backend logs show client IP 127.0.0.1 -> Root cause: Localhost forwarding or SNAT -> Fix: Review proxy mode and ensure source IP preservation.
- Symptom: Debugging long-lived sessions is hard -> Root cause: No session lifecycle traces -> Fix: Add tracing for connection lifecycle and sampling.
- Symptom: Alerts fire too often -> Root cause: Too tight thresholds and no dedupe -> Fix: Adjust thresholds, add grouping and suppression.
- Symptom: Failed failover during outage -> Root cause: Manual failover steps not automated -> Fix: Automate failover runbooks and test with game days.
- Symptom: High backend CPU from TLS -> Root cause: TLS terminated at backend -> Fix: Offload TLS at NLB if supported or use hardware acceleration.
- Symptom: Cross-zone bandwidth charges high -> Root cause: Cross-zone routing not optimized -> Fix: Review cross-zone balancing policies.
- Symptom: Metrics cardinality explosion -> Root cause: Tagging many ephemeral targets without aggregation -> Fix: Aggregate labels and use recording rules.
- Symptom: Silent degradation noticed only by users -> Root cause: No synthetic tests across geos -> Fix: Deploy global synthetic monitors.
- Symptom: NLB config drift -> Root cause: Manual changes in production -> Fix: Enforce IaC and policy as code.
- Symptom: Connection resets at scale -> Root cause: Backend ephemeral port exhaustion -> Fix: Increase ephemeral port range or use NAT pooling.
- Symptom: Observability blindspot for UDP -> Root cause: Tooling focused on TCP/HTTP -> Fix: Extend probes and metrics to UDP flows.
- Symptom: Long deployment windows due to draining -> Root cause: No graceful shutdown support in app -> Fix: Implement signal handling and drain endpoints.
- Symptom: Difficulty correlating NLB events with app errors -> Root cause: No distributed tracing for connection flows -> Fix: Tag requests and add correlation IDs where possible.
- Symptom: Alert fatigue during scheduled maintenance -> Root cause: No suppression windows -> Fix: Implement maintenance mode and suppression policies.
Observability pitfalls (at least 5 included above)
- Relying only on L7 traces.
- Missing synthetic L4 checks.
- High cardinality labels causing cost.
- Lack of connection lifecycle tracing.
- No correlation between deployment events and NLB metrics.
Best Practices & Operating Model
Ownership and on-call
- Assign clear ownership for NLB resources at team level.
- Cross-functional on-call that includes network and service owners.
- Define escalation paths for global NLB incidents.
Runbooks vs playbooks
- Runbook: Step-by-step operations for routine tasks and incidents.
- Playbook: Scenario-driven decision frameworks for complex incidents with multiple services.
Safe deployments (canary/rollback)
- Use canary targets and staged traffic shift.
- Implement connection draining and graceful shutdown signals.
- Automate rollback on SLI regressions.
Toil reduction and automation
- Automate capacity scaling based on NLB metrics.
- Use IaC to remove manual changes.
- Automate remediation for common health-check failures.
Security basics
- Restrict control plane access to admins.
- Use least privilege for target registration.
- Enable DDoS protection and rate limiting where available.
- Monitor for anomalous PPS and byte patterns.
Weekly/monthly routines
- Weekly: Review NLB alerts and recent configuration changes.
- Monthly: Validate capacity planning and run a small scale failover drill.
- Quarterly: Chaos exercises and audit of health-check coverage and topology.
Postmortem review items related to NLB
- Correlate NLB metrics with incident timeline.
- Check if health checks and timeouts were appropriate.
- Validate automation and runbook effectiveness.
- Update SLOs and cadence based on incident learnings.
Tooling & Integration Map for Network load balancer (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | Cloud NLB service | Managed Layer 4 load balancing | Kubernetes, Autoscaling, DNS | Use for public TCP UDP endpoints |
| I2 | NLB controller | Integrates NLB with Kubernetes | Service API and cloud APIs | Manages target registration |
| I3 | Metrics backend | Stores NLB metrics and history | Prometheus and Grafana | Essential for SLIs |
| I4 | Synthetic monitoring | External L4 probes and checks | Alerting and dashboards | Provides user perspective |
| I5 | eBPF probes | Kernel packet tracing and diagnostics | Host observability stacks | Deep packet-level visibility |
| I6 | DDoS protection | Traffic scrubbing and policing | NLB and edge routers | Protects from volumetric attacks |
| I7 | Deploy pipeline | Automates NLB config changes | IaC and CI tools | Prevents config drift |
| I8 | Incident management | Alerts and on-call routing | Pager and ticketing | Critical during outages |
| I9 | Network appliances | Hardware acceleration and offload | On-prem and cloud interconnect | High performance but costlier |
| I10 | Firewall/NACL | Access control and security policies | VPC and subnet configs | Must allow health checks |
Row Details (only if needed)
Not needed.
Frequently Asked Questions (FAQs)
What protocols does a Network load balancer support?
Most support TCP and UDP; QUIC support varies by provider. Check provider features; some support TLS passthrough.
Can NLB terminate TLS?
Some NLBs support TLS termination; others only passthrough TLS. Varies by provider.
Does an NLB preserve source IP?
Many NLBs can preserve source IPs depending on configuration and whether DNAT or proxy mode is used.
How does health checking work at L4?
Usually TCP or UDP probes attempting connect or sending probe packets; they do not verify application logic.
How to debug asymmetric routing with NLB?
Use packet captures and eBPF traces to verify forward and return paths and ensure backend routes through NLB or SNAT.
What are common limits to watch?
Connection table size, PPS capacity, and backend registration limits; specifics vary by provider.
Should I use NLB for HTTP APIs?
Only if you need L4 performance or TLS passthrough; otherwise L7 load balancers provide richer features.
How to handle long-lived connections?
Tune idle timeouts, enable connection draining, and ensure connection table headroom.
How to secure NLB endpoints?
Restrict management access, use security groups, enable DDoS protection, and implement rate limiting.
How to test NLB scalability?
Run synthetic load tests that simulate realistic session durations and packet sizes; include cold starts.
What telemetry is essential?
Connection success rate, health checks, connection table utilization, PPS, and throughput.
How to avoid noisy alerts?
Group by service, set reasonable thresholds, use dedupe and suppression windows.
Can NLB be used in multi-region active-active?
Yes, via anycast, global load balancing, or DNS-based steering combined with regional NLBs.
How to preserve client IP in Kubernetes?
Use Service externalTrafficPolicy Local or NLB modes that preserve source IP.
Do NLBs support rate limiting?
Some do; otherwise implement rate limiting at upstream proxies or firewall rules.
What causes connection table exhaustion?
Too many concurrent connections, especially many small packets from IoT or bots.
How to correlate NLB events with app logs?
Add correlation IDs and trace connection lifecycle across NLB and backend logs.
Are NLB metrics reliable for billing spikes?
They are indicative but ensure to correlate with billing data; some metrics may be sampled.
Conclusion
Network load balancers remain a critical building block for low-latency, high-throughput, non-HTTP services in modern cloud-native architectures. Proper instrumentation, SLO-driven operations, and automated runbooks reduce incidents and support rapid team velocity. Design choices around health checks, connection lifecycle, and capacity directly affect customer experience.
Next 7 days plan (practical actions)
- Day 1: Inventory all services using NLB and map owners.
- Day 2: Ensure Prometheus or cloud metrics collection for NLBs is enabled.
- Day 3: Create or update runbooks for top three NLB failure modes.
- Day 4: Add synthetic L4 probes for critical external endpoints.
- Day 5: Run a small-scale load test verifying connection table headroom.
- Day 6: Review SLOs and update alerting to reduce noise and add burn-rate checks.
- Day 7: Schedule a chaos test to simulate backend failures and practice runbook steps.
Appendix — Network load balancer Keyword Cluster (SEO)
Primary keywords
- network load balancer
- NLB
- layer 4 load balancer
- TCP load balancer
- UDP load balancer
- low latency load balancer
- high throughput load balancer
- managed NLB
- cloud NLB
Secondary keywords
- connection tracking
- virtual IP for NLB
- NLB health checks
- preserve source IP
- connection draining
- NLB autoscaling
- NLB best practices
- NLB metrics
- NLB troubleshooting
- NLB architecture
Long-tail questions
- what is a network load balancer used for
- how does a network load balancer work
- network load balancer vs application load balancer
- how to preserve client ip with nlb
- nlb health check not passing
- nlb connection table full
- how to measure nlb performance
- nlb best practices for kubernetes
- nlb configuration for udp
- how to debug nlb asymmetric routing
- can nlb terminate tls
- how to scale nlb for gaming servers
- nlb observability and metrics
- nlb latency and p95 targets
- nlb failover strategies for multi region
- how to run chaos tests on nlb
- nlb cost optimization tips
- nlb and anycast architecture
- nlb for serverless tcp endpoints
- nlb synthetic monitoring setup
Related terminology
- VIP
- target group
- DNAT
- SNAT
- PPS packets per second
- throughput bps
- health probe
- externalTrafficPolicy
- kube service loadbalancer
- eBPF tracing
- packet capture
- connection table
- idle timeout
- session affinity
- flow hashing
- TLS passthrough
- TLS termination
- DDoS protection
- synthetic monitoring
- autoscale policy
- cross-zone balancing
- anycast routing
- BGP failover
- kernel bypass
- hardware offload
- connection draining
- observability sampling
- burn rate
- SLI SLO
- incident runbook
- chaos engineering
- load testing
- cold start
- throttle and rate limit
- origin server
- edge network
- NAT gateway
- reverse proxy
- application load balancer