Operations Overview
This chapter covers the day-to-day operational concerns of running Nantian Gateway in production. You’ll find reference material for metrics, dashboards, alerting rules, troubleshooting procedures, and backup and recovery.
Operational model
Section titled “Operational model”Nantian Gateway runs as a set of Kubernetes workloads. The control plane is a Deployment (typically two replicas with leader election), the data plane is a separate Deployment (scaled independently), and the dashboard is a single-replica Deployment. All three are managed through the Helm chart or Kustomize overlays.
The operational surface is intentionally small. There’s no database to manage, no external coordination service to maintain, and no persistent state outside of Kubernetes itself. The control plane stores node status in Kubernetes Lease objects. Everything else is in-memory and rebuilt from the Kubernetes API on restart.
Monitoring stack
Section titled “Monitoring stack”Nantian Gateway ships with a pre-built observability stack:
| Component | Location | Purpose |
|---|---|---|
| Prometheus metrics | Control plane :18082, Data plane admin | Scrape targets for time-series metrics |
| Grafana dashboard | deploy/observability/grafana/ | Pre-built dashboard with executive overview, control plane panels, and data plane panels |
| Prometheus rules | deploy/observability/prometheus/ | Recording rules for data plane metrics, both native and Prometheus Operator formats |
| Structured logs | stdout (JSON format) | Log aggregation with Loki, Elasticsearch, or Datadog |
| OpenTelemetry traces | Configurable | Distributed tracing for request flows |
Metrics architecture
Section titled “Metrics architecture”+------------------+ scrape +------------------+
| Control Plane | <---------- | Prometheus || :18082/metrics | | |+------------------+ +--------+---------+ |+------------------+ scrape +---------v---------+
| Data Plane | <---------- | Grafana || admin/metrics | | Dashboard |+------------------+ +-------------------+The control plane exposes metrics on :18082 by default. The data plane exposes metrics through its admin API, which can be scraped directly or aggregated through the control plane’s data plane aggregation feature.
Logging
Section titled “Logging”Both planes emit structured JSON logs to stdout. In Kubernetes, these are captured by the container runtime and can be collected by any log aggregation system that supports the container log format.
Control plane log format:
{"time":"2026-06-06T12:00:00Z","level":"INFO","msg":"snapshot published","component":"controlplane","version":"abc123"}Data plane log format:
{"timestamp":"2026-06-06T12:00:00.000Z","level":"INFO","target":"nantian_core::proxy","message":"configuration applied","version":"abc123"}Set log.level to debug for troubleshooting. Debug logs include reconciliation details, snapshot diffs, gRPC stream events, and proxy-level request information. Debug logging generates significantly more output and should not be left enabled in production.
Health checks
Section titled “Health checks”The control plane exposes Kubernetes-compatible health probes:
| Probe | Endpoint | Port | Description |
|---|---|---|---|
| Liveness | GET /livez | :18083 | Returns 200 if the process is alive |
| Readiness | GET /readyz | :18083 | Returns 200 when the startup gate is open |
The readiness probe is gated by the lifecycle supervisor’s startup gate. The gate remains closed until all components (manager, admin server, metrics server, gRPC server) have started successfully. This prevents Kubernetes from routing traffic to a control plane that hasn’t finished initializing.
The Admin API also provides /livez and /readyz endpoints on the admin port (:18081). These are used by the dashboard and external monitoring tools.
Common operational tasks
Section titled “Common operational tasks”Checking gateway health
Section titled “Checking gateway health”# Check control plane status via Admin APIcurl -s http://localhost:18081/v1/summary | jq .
# Check connected data plane nodescurl -s http://localhost:18081/v1/nodes | jq .
# Check Prometheus metricscurl -s http://localhost:18082/metrics | grep nantian_gatewayViewing the current configuration
Section titled “Viewing the current configuration”# Full IR snapshotcurl -s http://localhost:18081/v1/snapshot | jq .
# Listeners and attached routescurl -s http://localhost:18081/v1/listeners | jq .
# Routes by kindcurl -s "http://localhost:18081/v1/routes?kind=HTTPRoute" | jq .Restarting a data plane
Section titled “Restarting a data plane”Data planes are stateless. You can restart them at any time without losing configuration:
kubectl rollout restart deployment/nantian-gw-dataplane -n nantian-gwThe restarted data plane will reconnect to the control plane, receive the current snapshot, and resume serving traffic. Existing connections to other data plane replicas are unaffected.
Restarting the control plane
Section titled “Restarting the control plane”The control plane is also stateless. Restarting the leader triggers a leader election. The standby replica takes over within the lease duration (default 15 seconds):
kubectl rollout restart deployment/nantian-gw-controlplane -n nantian-gwDuring the transition, the data plane continues serving traffic with the last received configuration. No traffic interruption occurs.
Resource requirements
Section titled “Resource requirements”Recommended resource allocations for production workloads:
| Component | CPU Request | CPU Limit | Memory Request | Memory Limit |
|---|---|---|---|---|
| Control plane | 100m | 500m | 128Mi | 512Mi |
| Data plane | 500m | 2 | 256Mi | 1Gi |
| Dashboard | 50m | 200m | 64Mi | 256Mi |
The data plane’s resource usage scales with traffic volume. More connections, higher throughput, and more complex routing rules all increase CPU and memory consumption. Monitor the Grafana dashboard’s resource panels and adjust limits accordingly.
Chapter structure
Section titled “Chapter structure”| Page | Covers |
|---|---|
| Metrics Reference | Complete catalog of Prometheus metrics emitted by both planes |
| Grafana Dashboard | How to import and customize the bundled Grafana dashboard |
| Alerting Rules | Recommended Prometheus alerting rules for production |
| Troubleshooting | Common issues, diagnostic commands, and resolution steps |
| Backup & Recovery | What to back up and how to recover from failures |
What’s next
Section titled “What’s next”- Metrics Reference — every metric, label, and what it means
- Grafana Dashboard — set up the pre-built dashboard
- Alerting Rules — alerts you should configure