Skip to content

Operations Overview

This chapter covers the day-to-day operational concerns of running Nantian Gateway in production. You’ll find reference material for metrics, dashboards, alerting rules, troubleshooting procedures, and backup and recovery.

Nantian Gateway runs as a set of Kubernetes workloads. The control plane is a Deployment (typically two replicas with leader election), the data plane is a separate Deployment (scaled independently), and the dashboard is a single-replica Deployment. All three are managed through the Helm chart or Kustomize overlays.

The operational surface is intentionally small. There’s no database to manage, no external coordination service to maintain, and no persistent state outside of Kubernetes itself. The control plane stores node status in Kubernetes Lease objects. Everything else is in-memory and rebuilt from the Kubernetes API on restart.

Nantian Gateway ships with a pre-built observability stack:

ComponentLocationPurpose
Prometheus metricsControl plane :18082, Data plane adminScrape targets for time-series metrics
Grafana dashboarddeploy/observability/grafana/Pre-built dashboard with executive overview, control plane panels, and data plane panels
Prometheus rulesdeploy/observability/prometheus/Recording rules for data plane metrics, both native and Prometheus Operator formats
Structured logsstdout (JSON format)Log aggregation with Loki, Elasticsearch, or Datadog
OpenTelemetry tracesConfigurableDistributed tracing for request flows
+------------------+ scrape +------------------+
| Control Plane | <---------- | Prometheus |
| :18082/metrics | | |
+------------------+ +--------+---------+
|
+------------------+ scrape +---------v---------+
| Data Plane | <---------- | Grafana |
| admin/metrics | | Dashboard |
+------------------+ +-------------------+

The control plane exposes metrics on :18082 by default. The data plane exposes metrics through its admin API, which can be scraped directly or aggregated through the control plane’s data plane aggregation feature.

Both planes emit structured JSON logs to stdout. In Kubernetes, these are captured by the container runtime and can be collected by any log aggregation system that supports the container log format.

Control plane log format:

{"time":"2026-06-06T12:00:00Z","level":"INFO","msg":"snapshot published","component":"controlplane","version":"abc123"}

Data plane log format:

{"timestamp":"2026-06-06T12:00:00.000Z","level":"INFO","target":"nantian_core::proxy","message":"configuration applied","version":"abc123"}

Set log.level to debug for troubleshooting. Debug logs include reconciliation details, snapshot diffs, gRPC stream events, and proxy-level request information. Debug logging generates significantly more output and should not be left enabled in production.

The control plane exposes Kubernetes-compatible health probes:

ProbeEndpointPortDescription
LivenessGET /livez:18083Returns 200 if the process is alive
ReadinessGET /readyz:18083Returns 200 when the startup gate is open

The readiness probe is gated by the lifecycle supervisor’s startup gate. The gate remains closed until all components (manager, admin server, metrics server, gRPC server) have started successfully. This prevents Kubernetes from routing traffic to a control plane that hasn’t finished initializing.

The Admin API also provides /livez and /readyz endpoints on the admin port (:18081). These are used by the dashboard and external monitoring tools.

Terminal window
# Check control plane status via Admin API
curl -s http://localhost:18081/v1/summary | jq .
# Check connected data plane nodes
curl -s http://localhost:18081/v1/nodes | jq .
# Check Prometheus metrics
curl -s http://localhost:18082/metrics | grep nantian_gateway
Terminal window
# Full IR snapshot
curl -s http://localhost:18081/v1/snapshot | jq .
# Listeners and attached routes
curl -s http://localhost:18081/v1/listeners | jq .
# Routes by kind
curl -s "http://localhost:18081/v1/routes?kind=HTTPRoute" | jq .

Data planes are stateless. You can restart them at any time without losing configuration:

Terminal window
kubectl rollout restart deployment/nantian-gw-dataplane -n nantian-gw

The restarted data plane will reconnect to the control plane, receive the current snapshot, and resume serving traffic. Existing connections to other data plane replicas are unaffected.

The control plane is also stateless. Restarting the leader triggers a leader election. The standby replica takes over within the lease duration (default 15 seconds):

Terminal window
kubectl rollout restart deployment/nantian-gw-controlplane -n nantian-gw

During the transition, the data plane continues serving traffic with the last received configuration. No traffic interruption occurs.

Recommended resource allocations for production workloads:

ComponentCPU RequestCPU LimitMemory RequestMemory Limit
Control plane100m500m128Mi512Mi
Data plane500m2256Mi1Gi
Dashboard50m200m64Mi256Mi

The data plane’s resource usage scales with traffic volume. More connections, higher throughput, and more complex routing rules all increase CPU and memory consumption. Monitor the Grafana dashboard’s resource panels and adjust limits accordingly.

PageCovers
Metrics ReferenceComplete catalog of Prometheus metrics emitted by both planes
Grafana DashboardHow to import and customize the bundled Grafana dashboard
Alerting RulesRecommended Prometheus alerting rules for production
TroubleshootingCommon issues, diagnostic commands, and resolution steps
Backup & RecoveryWhat to back up and how to recover from failures