Production Deployment
Running Nantian Gateway in production means hardening the defaults against real-world conditions: traffic spikes, node failures, noisy neighbors, and the occasional misconfiguration that takes down a pod. This guide walks through the checklist you should complete before exposing the gateway to live traffic.
Put the settings from this guide into a custom values file (e.g. my-production.yaml) and layer it on top of the default chart values.
Production checklist
Section titled “Production checklist”Work through these items in order. Each one addresses a specific failure mode that the default development-oriented settings don’t protect against.
1. Resource limits
Section titled “1. Resource limits”Default resource requests and limits are tuned for development. Production workloads need headroom.
| Component | Request CPU | Request Memory | Limit CPU | Limit Memory |
|---|---|---|---|---|
| Control plane | 200m | 256Mi | 1 | 1Gi |
| Data plane | 2 | 512Mi | (none) | 2Gi |
The data plane carries the traffic load. Its CPU request should reflect the expected throughput. Memory limits are especially important for the data plane since Rust’s allocator can hold onto memory between GC cycles.
Set these in your values file:
controlplane: resources: requests: cpu: "200m" memory: "256Mi" limits: cpu: "1" memory: "1Gi"
dataplane: resources: requests: cpu: "2" memory: "512Mi" limits: memory: "2Gi"2. Replica counts
Section titled “2. Replica counts”Run at least two control plane replicas and three data plane replicas. The control plane uses leader election, so only one replica is active at a time. The standby handles failover. Data plane replicas are all active and share traffic.
controlplane: replicas: 2
dataplane: replicas: 33. Pod anti-affinity
Section titled “3. Pod anti-affinity”Without anti-affinity, the scheduler might colocate all your data plane pods on the same node. A node failure then takes down your entire data plane.
dataplane: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/component: dataplane topologyKey: kubernetes.io/hostnameThis ensures no two data plane pods land on the same node. If you have enough nodes, also spread across zones:
dataplane: topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app.kubernetes.io/component: dataplane4. TLS everywhere
Section titled “4. TLS everywhere”In production, every connection should be encrypted. The chart supports three TLS layers:
gRPC TLS between data plane and control plane. The control plane serves a TLS certificate and the data plane validates it. Optionally, enable mutual TLS so the control plane also authenticates the data plane.
controlplane: grpcTLS: enabled: true existingSecret: "nantian-grpc-tls" requireClientCert: true
dataplane: xdsTLS: enabled: true domainName: "nantian-controlplane-grpc.nantian-gw.svc.cluster.local"The secret must contain tls.crt, tls.key, and optionally ca.crt. For mTLS, the data plane also needs a client certificate configured via its xDS client settings.
Downstream TLS for traffic entering the gateway. This is configured through Gateway API listeners, not the Helm chart. See the TLS configuration guide for details.
Upstream TLS to backend services. Configured per-route through BackendTLSPolicy resources.
5. Security contexts
Section titled “5. Security contexts”Lock down the container runtime:
controlplane: podSecurityContext: runAsNonRoot: true runAsUser: 65532 runAsGroup: 65532 fsGroup: 65532 containerSecurityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true
dataplane: podSecurityContext: runAsNonRoot: true runAsUser: 65532 runAsGroup: 65532 fsGroup: 65532 containerSecurityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: [ALL] add: [NET_BIND_SERVICE]The data plane needs NET_BIND_SERVICE to bind to privileged ports (below 1024). All other capabilities are dropped. The control plane binds to high ports (18080+) so it doesn’t need any special capabilities.
6. Logging and observability
Section titled “6. Logging and observability”Switch to structured JSON logging in production. Your log aggregation system will parse it more reliably than text formats.
dataplane: config: log: level: "info,nantian_core::connectors=off" format: "json" nonBlocking: true nonBlockingBufferedLines: 65536 dropWhenFull: truenonBlocking: true keeps the data plane from stalling on slow log writes. dropWhenFull: true discards log lines rather than blocking under extreme load.
Enable access logging with a format that captures the fields your monitoring needs:
dataplane: config: accessLog: enabled: true path: "/var/log/nantian-gw/access.log" mode: "text"For metrics, enable the ServiceMonitor resource if you’re using the Prometheus operator:
monitoring: serviceMonitor: enabled: true interval: "30s"7. PodDisruptionBudget
Section titled “7. PodDisruptionBudget”PDBs prevent voluntary disruptions (node drains, cluster autoscaler) from taking down too many replicas at once.
controlplane: pdb: minAvailable: 1
dataplane: pdb: minAvailable: 2With three data plane replicas and minAvailable: 2, the cluster can drain one node at a time without dropping below the minimum.
8. Node selection and tolerations
Section titled “8. Node selection and tolerations”If you run dedicated infrastructure nodes, pin the gateway components to them:
controlplane: nodeSelector: node-role.kubernetes.io/infra: "" tolerations: - key: "node-role.kubernetes.io/infra" operator: "Exists" effect: "NoSchedule"
dataplane: nodeSelector: node-role.kubernetes.io/infra: "" tolerations: - key: "node-role.kubernetes.io/infra" operator: "Exists" effect: "NoSchedule"9. Private registry
Section titled “9. Private registry”Production deployments should pull images from a private registry you control:
global: imagePullSecrets: - my-registry-secret
controlplane: image: registry: "registry.example.com" repository: "nantian-gw/nantian-controlplane" tag: "latest"
dataplane: image: registry: "registry.example.com" repository: "nantian-gw/dataplane" tag: "latest"Deploying
Section titled “Deploying”Once you’ve assembled your values file:
helm install nantian-gw nantian-gw/nantian-gw \ -f my-production.yamlHelm merges the files in order, with later files overriding earlier ones.
Post-deployment verification
Section titled “Post-deployment verification”After the install completes, verify each checklist item:
# Check replicas are runningkubectl get pods -n nantian-gw
# Verify anti-affinity (each pod on a different node)kubectl get pods -n nantian-gw -o wide -l app.kubernetes.io/component=dataplane
# Confirm TLS is active (look for TLS handshake in control plane logs)kubectl logs -n nantian-gw deployment/nantian-gw-controlplane | grep -i tls
# Check PDBskubectl get pdb -n nantian-gw
# Verify security contextskubectl get pods -n nantian-gw -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].securityContext}{"\n"}{end}'If anything doesn’t match, fix the values and upgrade:
helm upgrade nantian-gw nantian-gw/nantian-gw -f my-environment.yaml