Runbook: integration-gateway down
Service affected: services/integration-gateway (Go gRPC + HTTP webhook listener)
Owner: integration team
Severity: HIGH — disbursement attempts from the API will fail; existing in-flight disbursements still settle via the bank's async webhook (which terminates at this same service), so a long outage stalls every pending settlement.
Symptom
Any one of:
/readyzon the API reportsledger: up,gateway: down(or skipped ifGATEWAY_GRPC_ADDRunset).- Alert
demozpay_dependency_up{dependency="integration-gateway"} == 0for > 1 minute. - A spike in
demozpay_ledger_grpc_duration_seconds{rpc="PostTransaction"}failures with the error message includingconnection refusedorname resolution failed. - User-reported:
POST /api/ewa/requests/:id/disbursereturns 502 / 504 / Internal. kubectl get pods -n demozshowsgateway-*inCrashLoopBackOff,OOMKilled, orTerminated (1).
Likely causes
- The gateway crashed at boot — almost always a config error: missing
GATEWAY_DASHEN_SIGNING_KEY, wrongGATEWAY_DATABASE_URL, or partner-adapter init failure (e.g. invalidGATEWAY_DASHEN_BASE_URL). - pg-gateway is unreachable — pgx pool init fails fast with a clear error; the gateway exits before serving.
- Out of memory — extremely unlikely for the gateway (its working set is tiny) but can happen if a runaway webhook payload was POSTed. OOM-killer leaves a Kubernetes event.
- Image pull failure during a roll-out —
ImagePullBackOff. - Network policy / service mesh blocking the gateway's ports — typically after a Calico / Istio config change.
Diagnosis steps
- Capture the boot logs.
kubectl logs -n demoz deploy/gateway --tail=200. The Go service emits JSON logs; the last line before exit identifies the cause for cases 1–3. Examples:"config load failed", "err": "GATEWAY_DASHEN_SIGNING_KEY is required"→ cause 1."pg pool init failed", "err": "..."→ cause 2.- No log at all, exit code 137 + Kubernetes event
Reason: OOMKilled→ cause 3.
- Pod state.
kubectl describe pod -n demoz <pod>shows recent events;ImagePullBackOffjumps out for cause 4. Check the image tag against the release manifest. - Network reachability. From an API pod:
nc -zv gateway 50052. Ifconnection refusedfrom inside the cluster but the pod isRunning, the gateway hasn't bound the port yet — usually still booting. Ifno route to host, that's a network policy issue (cause 5). - pg-gateway state.
kubectl get pod -n demoz -l app=pg-gateway→ if notRunning+Healthy, the gateway can't come up. Diagnose Postgres first.
Mitigation
- Config error (cause 1): roll back to the last-good ConfigMap / Secret and redeploy. Add the missing var; redeploy forward.
- pg-gateway unreachable (cause 2): bring pg-gateway back. If the DB itself is healthy, this is almost always a network or auth (password) issue. Check the Kubernetes Secret holding
GATEWAY_DATABASE_URL. - OOM (cause 3): bump the pod's memory limit by 1.5×, redeploy, and follow up to find the runaway payload — the gateway should never see a single request that large; the upstream API is doing something wrong.
- Image pull (cause 4): confirm the image tag exists in the registry; if a pull-secret expired, refresh it.
- Network policy (cause 5): temporarily widen the policy to allow
api → gateway:50052and<partner-public-CIDR> → gateway:50053; pin the change with the network owner before closing the ticket.
While the gateway is down
- In-flight disbursements (rows in
disbursement.status = INITIATED/SUBMITTED/ACCEPTED) are NOT lost. The partner has them. When the gateway returns, its webhook will receive partner callbacks and settle them. - API-side EWA / loan rows stuck in
SUBMITTED_TO_BANKare equally safe. The settlement-poller will re-poll once the gateway returns. - New disbursement attempts during the outage will fail-fast at the API: the
DisburseEwaUseCase/DisburseLoanUseCasecatches the gRPC error and leaves the row inAPPROVED. The user-facing endpoint returns 500. Do not retry server-side automatically — let the user retry once the alert clears.
Resolution
- Cause 1 / 4: add a synthetic boot-check to CI that loads the runtime ConfigMap + Secret into a dry-run instance of the gateway and asserts boot success. This catches missing-config bugs before they hit production.
- Cause 2: add a pg-gateway readiness gate in front of the gateway deployment. If pg-gateway isn't healthy, don't start a new gateway replica.
- Cause 3: tighten the rawBody size limit on the gateway's webhook handler. Partners should never send a callback > 10 KB; reject larger.
- Cause 5: codify the required network policy in
infra/k8s/so it's reviewable.
Escalation
- 30 minutes without diagnosis → page the platform on-call.
- pg-gateway is irrecoverable (data loss / corruption suspected) → page DB on-call AND finance/ops. We will need to manually reconcile against partner statements before we trust anything.
- Outage > 1 hour with active disbursements stalled → notify customer success; they may need to communicate downstream.
Related
- The gateway's settle-side behaviour after recovery is in
services/integration-gateway/internal/server/get_status.goandwebhook/handler.go. - API-side resilience around gateway calls:
apps/api/src/products/ewa/integration-gateway.grpc-client.ts. - The startup-checks list lives in
apps/api/src/_infra/health/startup-checks.service.ts— confirm the gateway probe is wired before the next on-call rota.