Skip to content

Production Deployment

Running Nantian Gateway in production means hardening the defaults against real-world conditions: traffic spikes, node failures, noisy neighbors, and the occasional misconfiguration that takes down a pod. This guide walks through the checklist you should complete before exposing the gateway to live traffic.

Put the settings from this guide into a custom values file (e.g. my-production.yaml) and layer it on top of the default chart values.

Work through these items in order. Each one addresses a specific failure mode that the default development-oriented settings don’t protect against.

Default resource requests and limits are tuned for development. Production workloads need headroom.

ComponentRequest CPURequest MemoryLimit CPULimit Memory
Control plane200m256Mi11Gi
Data plane2512Mi(none)2Gi

The data plane carries the traffic load. Its CPU request should reflect the expected throughput. Memory limits are especially important for the data plane since Rust’s allocator can hold onto memory between GC cycles.

Set these in your values file:

controlplane:
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "1"
memory: "1Gi"
dataplane:
resources:
requests:
cpu: "2"
memory: "512Mi"
limits:
memory: "2Gi"

Run at least two control plane replicas and three data plane replicas. The control plane uses leader election, so only one replica is active at a time. The standby handles failover. Data plane replicas are all active and share traffic.

controlplane:
replicas: 2
dataplane:
replicas: 3

Without anti-affinity, the scheduler might colocate all your data plane pods on the same node. A node failure then takes down your entire data plane.

dataplane:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: dataplane
topologyKey: kubernetes.io/hostname

This ensures no two data plane pods land on the same node. If you have enough nodes, also spread across zones:

dataplane:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/component: dataplane

In production, every connection should be encrypted. The chart supports three TLS layers:

gRPC TLS between data plane and control plane. The control plane serves a TLS certificate and the data plane validates it. Optionally, enable mutual TLS so the control plane also authenticates the data plane.

controlplane:
grpcTLS:
enabled: true
existingSecret: "nantian-grpc-tls"
requireClientCert: true
dataplane:
xdsTLS:
enabled: true
domainName: "nantian-controlplane-grpc.nantian-gw.svc.cluster.local"

The secret must contain tls.crt, tls.key, and optionally ca.crt. For mTLS, the data plane also needs a client certificate configured via its xDS client settings.

Downstream TLS for traffic entering the gateway. This is configured through Gateway API listeners, not the Helm chart. See the TLS configuration guide for details.

Upstream TLS to backend services. Configured per-route through BackendTLSPolicy resources.

Lock down the container runtime:

controlplane:
podSecurityContext:
runAsNonRoot: true
runAsUser: 65532
runAsGroup: 65532
fsGroup: 65532
containerSecurityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
dataplane:
podSecurityContext:
runAsNonRoot: true
runAsUser: 65532
runAsGroup: 65532
fsGroup: 65532
containerSecurityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: [ALL]
add: [NET_BIND_SERVICE]

The data plane needs NET_BIND_SERVICE to bind to privileged ports (below 1024). All other capabilities are dropped. The control plane binds to high ports (18080+) so it doesn’t need any special capabilities.

Switch to structured JSON logging in production. Your log aggregation system will parse it more reliably than text formats.

dataplane:
config:
log:
level: "info,nantian_core::connectors=off"
format: "json"
nonBlocking: true
nonBlockingBufferedLines: 65536
dropWhenFull: true

nonBlocking: true keeps the data plane from stalling on slow log writes. dropWhenFull: true discards log lines rather than blocking under extreme load.

Enable access logging with a format that captures the fields your monitoring needs:

dataplane:
config:
accessLog:
enabled: true
path: "/var/log/nantian-gw/access.log"
mode: "text"

For metrics, enable the ServiceMonitor resource if you’re using the Prometheus operator:

monitoring:
serviceMonitor:
enabled: true
interval: "30s"

PDBs prevent voluntary disruptions (node drains, cluster autoscaler) from taking down too many replicas at once.

controlplane:
pdb:
minAvailable: 1
dataplane:
pdb:
minAvailable: 2

With three data plane replicas and minAvailable: 2, the cluster can drain one node at a time without dropping below the minimum.

If you run dedicated infrastructure nodes, pin the gateway components to them:

controlplane:
nodeSelector:
node-role.kubernetes.io/infra: ""
tolerations:
- key: "node-role.kubernetes.io/infra"
operator: "Exists"
effect: "NoSchedule"
dataplane:
nodeSelector:
node-role.kubernetes.io/infra: ""
tolerations:
- key: "node-role.kubernetes.io/infra"
operator: "Exists"
effect: "NoSchedule"

Production deployments should pull images from a private registry you control:

global:
imagePullSecrets:
- my-registry-secret
controlplane:
image:
registry: "registry.example.com"
repository: "nantian-gw/nantian-controlplane"
tag: "latest"
dataplane:
image:
registry: "registry.example.com"
repository: "nantian-gw/dataplane"
tag: "latest"

Once you’ve assembled your values file:

Terminal window
helm install nantian-gw nantian-gw/nantian-gw \
-f my-production.yaml

Helm merges the files in order, with later files overriding earlier ones.

After the install completes, verify each checklist item:

Terminal window
# Check replicas are running
kubectl get pods -n nantian-gw
# Verify anti-affinity (each pod on a different node)
kubectl get pods -n nantian-gw -o wide -l app.kubernetes.io/component=dataplane
# Confirm TLS is active (look for TLS handshake in control plane logs)
kubectl logs -n nantian-gw deployment/nantian-gw-controlplane | grep -i tls
# Check PDBs
kubectl get pdb -n nantian-gw
# Verify security contexts
kubectl get pods -n nantian-gw -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].securityContext}{"\n"}{end}'

If anything doesn’t match, fix the values and upgrade:

Terminal window
helm upgrade nantian-gw nantian-gw/nantian-gw -f my-environment.yaml