Enterprise & high availability

Run many replicas, safely

UDB is a broker — backend databases own data replication; UDB owns broker HA: leader election for singleton work, real recovery, backpressure, observability, and operational runbooks. Multiple replicas run behind a load balancer without duplicating singleton work.

HA control plane

Leader election & singleton coordination

CDC tailers, saga/2PC recovery, and reapers are singleton workers — UDB coordinates them with a durable Postgres row lease (heartbeat, TTL, fencing token, peer-skip), so two replicas never double-run them and lease-owner loss causes bounded failover to another replica.

👑

Leased singletons

CDC tailer, saga recovery, 2PC in-doubt recovery, storage orphan reaper, and projection workers run under one shared singleton lease with a monotonic fencing token.

🧩

Real 2PC recovery

A durable in-doubt participant ledger with idempotent commit/rollback retry and lease-protected ownership; split-brain races resolve to a single terminal outcome.

↩️

Real saga recovery

Durable state transitions, idempotent compensations, and resumable recovery after restart — proven by multi-node peer-skip tests.

Edge scale

Backpressure & per-tenant fairness

✓
Global + per-method concurrency limits via bounded operation channels and a server concurrency limit.
✓
Per-tenant concurrency — scoped semaphores so one tenant can’t starve others or exhaust the backend pool.
✓
Overload semantics — ResourceExhausted with a bounded queue timeout instead of unbounded waiting.

✓
Per-tenant rate limits on auth bootstrap, authz checks, storage presign/finalize, WebRTC signaling, and CDC/admin APIs.
✓
Stateless replicas — durable leases own background work; admission uses request metadata + shared stores, not process-local ownership.
✓
Shard-aware routing — tenant primary key, secondary indexes, scatter-gather fallback, and cross-shard cost metrics.

Observability

SLOs, traces, and honest health

🔭

OpenTelemetry

W3C trace propagation and spans at the chokepoints: grpc.request → authz.decide → backend.query/mutate → cdc.publish → saga.compensate.

📈

SLOs & error budgets

14 lanes with p50/p95/p99 + availability mapped to real metrics — authn, authz, storage, asset, WebRTC, CDC, and policy-distribution latency.

🩺

gRPC health per listener

doctor, GetHealthReport, gRPC health, and /readyz all fold the same readiness facts — capabilities match actual mounted routes.

Metrics that matter: method-security denials, tenant mismatch, revocation-lookup failures, CDC lag, DLQ depth, journal failures, outbox enqueue failures, compensation failures, native-service degraded states, and edge load-shedding — all exported for alerting.

Operability

Runbooks for the bad days

Signing-key compromise Refresh-token replay wave Authz policy rollback Tenant-wide emergency revoke CDC backlog / DLQ recovery Native-service dependency outage Object/vector backend partial failure Leader-election failover Bad policy rollout rollback

Multi-node, load, conformance, and compliance readiness gates run in CI or a documented staging gate before UDB is called enterprise-ready.