Runbook: integration-gateway down

Service affected: services/integration-gateway (Go gRPC + HTTP webhook listener) Owner: integration team Severity: HIGH — disbursement attempts from the API will fail; existing in-flight disbursements still settle via the bank's async webhook (which terminates at this same service), so a long outage stalls every pending settlement.

Symptom

Any one of:

/readyz on the API reports ledger: up, gateway: down (or skipped if GATEWAY_GRPC_ADDR unset).
Alert demozpay_dependency_up{dependency="integration-gateway"} == 0 for > 1 minute.
A spike in demozpay_ledger_grpc_duration_seconds{rpc="PostTransaction"} failures with the error message including connection refused or name resolution failed.
User-reported: POST /api/ewa/requests/:id/disburse returns 502 / 504 / Internal.
kubectl get pods -n demoz shows gateway-* in CrashLoopBackOff, OOMKilled, or Terminated (1).

Likely causes

The gateway crashed at boot — almost always a config error: missing GATEWAY_DASHEN_SIGNING_KEY, wrong GATEWAY_DATABASE_URL, or partner-adapter init failure (e.g. invalid GATEWAY_DASHEN_BASE_URL).
pg-gateway is unreachable — pgx pool init fails fast with a clear error; the gateway exits before serving.
Out of memory — extremely unlikely for the gateway (its working set is tiny) but can happen if a runaway webhook payload was POSTed. OOM-killer leaves a Kubernetes event.
Image pull failure during a roll-out — ImagePullBackOff.
Network policy / service mesh blocking the gateway's ports — typically after a Calico / Istio config change.

Diagnosis steps

Capture the boot logs. kubectl logs -n demoz deploy/gateway --tail=200. The Go service emits JSON logs; the last line before exit identifies the cause for cases 1–3. Examples:
- "config load failed", "err": "GATEWAY_DASHEN_SIGNING_KEY is required" → cause 1.
- "pg pool init failed", "err": "..." → cause 2.
- No log at all, exit code 137 + Kubernetes event Reason: OOMKilled → cause 3.
Pod state. kubectl describe pod -n demoz <pod> shows recent events; ImagePullBackOff jumps out for cause 4. Check the image tag against the release manifest.
Network reachability. From an API pod: nc -zv gateway 50052. If connection refused from inside the cluster but the pod is Running, the gateway hasn't bound the port yet — usually still booting. If no route to host, that's a network policy issue (cause 5).
pg-gateway state. kubectl get pod -n demoz -l app=pg-gateway → if not Running + Healthy, the gateway can't come up. Diagnose Postgres first.

Mitigation

Config error (cause 1): roll back to the last-good ConfigMap / Secret and redeploy. Add the missing var; redeploy forward.
pg-gateway unreachable (cause 2): bring pg-gateway back. If the DB itself is healthy, this is almost always a network or auth (password) issue. Check the Kubernetes Secret holding GATEWAY_DATABASE_URL.
OOM (cause 3): bump the pod's memory limit by 1.5×, redeploy, and follow up to find the runaway payload — the gateway should never see a single request that large; the upstream API is doing something wrong.
Image pull (cause 4): confirm the image tag exists in the registry; if a pull-secret expired, refresh it.
Network policy (cause 5): temporarily widen the policy to allow api → gateway:50052 and <partner-public-CIDR> → gateway:50053; pin the change with the network owner before closing the ticket.

While the gateway is down

In-flight disbursements (rows in disbursement.status = INITIATED / SUBMITTED / ACCEPTED) are NOT lost. The partner has them. When the gateway returns, its webhook will receive partner callbacks and settle them.
API-side EWA / loan rows stuck in SUBMITTED_TO_BANK are equally safe. The settlement-poller will re-poll once the gateway returns.
New disbursement attempts during the outage will fail-fast at the API: the DisburseEwaUseCase / DisburseLoanUseCase catches the gRPC error and leaves the row in APPROVED. The user-facing endpoint returns 500. Do not retry server-side automatically — let the user retry once the alert clears.

Resolution

Cause 1 / 4: add a synthetic boot-check to CI that loads the runtime ConfigMap + Secret into a dry-run instance of the gateway and asserts boot success. This catches missing-config bugs before they hit production.
Cause 2: add a pg-gateway readiness gate in front of the gateway deployment. If pg-gateway isn't healthy, don't start a new gateway replica.
Cause 3: tighten the rawBody size limit on the gateway's webhook handler. Partners should never send a callback > 10 KB; reject larger.
Cause 5: codify the required network policy in infra/k8s/ so it's reviewable.

Escalation

30 minutes without diagnosis → page the platform on-call.
pg-gateway is irrecoverable (data loss / corruption suspected) → page DB on-call AND finance/ops. We will need to manually reconcile against partner statements before we trust anything.
Outage > 1 hour with active disbursements stalled → notify customer success; they may need to communicate downstream.

The gateway's settle-side behaviour after recovery is in services/integration-gateway/internal/server/get_status.go and webhook/handler.go.
API-side resilience around gateway calls: apps/api/src/products/ewa/integration-gateway.grpc-client.ts.
The startup-checks list lives in apps/api/src/_infra/health/startup-checks.service.ts — confirm the gateway probe is wired before the next on-call rota.

Symptom​

Likely causes​

Diagnosis steps​

Mitigation​

While the gateway is down​

Resolution​

Escalation​

Related​