Run many replicas, safely
UDB is a broker — backend databases own data replication; UDB owns broker HA: leader election for singleton work, real recovery, backpressure, observability, and operational runbooks. Multiple replicas run behind a load balancer without duplicating singleton work.
Leader election & singleton coordination
CDC tailers, saga/2PC recovery, and reapers are singleton workers — UDB coordinates them with a durable Postgres row lease (heartbeat, TTL, fencing token, peer-skip), so two replicas never double-run them and lease-owner loss causes bounded failover to another replica.
Leased singletons
CDC tailer, saga recovery, 2PC in-doubt recovery, storage orphan reaper, and projection workers run under one shared singleton lease with a monotonic fencing token.
Real 2PC recovery
A durable in-doubt participant ledger with idempotent commit/rollback retry and lease-protected ownership; split-brain races resolve to a single terminal outcome.
Real saga recovery
Durable state transitions, idempotent compensations, and resumable recovery after restart — proven by multi-node peer-skip tests.
Backpressure & per-tenant fairness
- ✓Global + per-method concurrency limits via bounded operation channels and a server concurrency limit.
- ✓Per-tenant concurrency — scoped semaphores so one tenant can’t starve others or exhaust the backend pool.
- ✓Overload semantics —
ResourceExhaustedwith a bounded queue timeout instead of unbounded waiting.
- ✓Per-tenant rate limits on auth bootstrap, authz checks, storage presign/finalize, WebRTC signaling, and CDC/admin APIs.
- ✓Stateless replicas — durable leases own background work; admission uses request metadata + shared stores, not process-local ownership.
- ✓Shard-aware routing — tenant primary key, secondary indexes, scatter-gather fallback, and cross-shard cost metrics.
SLOs, traces, and honest health
OpenTelemetry
W3C trace propagation and spans at the chokepoints: grpc.request → authz.decide → backend.query/mutate → cdc.publish → saga.compensate.
SLOs & error budgets
14 lanes with p50/p95/p99 + availability mapped to real metrics — authn, authz, storage, asset, WebRTC, CDC, and policy-distribution latency.
gRPC health per listener
doctor, GetHealthReport, gRPC health, and /readyz all fold the same readiness facts — capabilities match actual mounted routes.
Metrics that matter: method-security denials, tenant mismatch, revocation-lookup failures, CDC lag, DLQ depth, journal failures, outbox enqueue failures, compensation failures, native-service degraded states, and edge load-shedding — all exported for alerting.
Runbooks for the bad days
Multi-node, load, conformance, and compliance readiness gates run in CI or a documented staging gate before UDB is called enterprise-ready.