Production Deployment

Running Nantian Gateway in production means hardening the defaults against real-world conditions: traffic spikes, node failures, noisy neighbors, and the occasional misconfiguration that takes down a pod. This guide walks through the checklist you should complete before exposing the gateway to live traffic.

Put the settings from this guide into a custom values file (e.g. my-production.yaml) and layer it on top of the default chart values.

Production checklist

Work through these items in order. Each one addresses a specific failure mode that the default development-oriented settings don’t protect against.

1. Resource limits

Default resource requests and limits are tuned for development. Production workloads need headroom.

Component	Request CPU	Request Memory	Limit CPU	Limit Memory
Control plane	200m	256Mi	1	1Gi
Data plane	2	512Mi	(none)	2Gi

The data plane carries the traffic load. Its CPU request should reflect the expected throughput. Memory limits are especially important for the data plane since Rust’s allocator can hold onto memory between GC cycles.

Set these in your values file:

controlplane:
  resources:
    requests:
      cpu: "200m"
      memory: "256Mi"
    limits:
      cpu: "1"
      memory: "1Gi"

dataplane:
  resources:
    requests:
      cpu: "2"
      memory: "512Mi"
    limits:
      memory: "2Gi"

2. Replica counts

Run at least two control plane replicas and three data plane replicas. The control plane uses leader election, so only one replica is active at a time. The standby handles failover. Data plane replicas are all active and share traffic.

controlplane:
  replicas: 2

dataplane:
  replicas: 3

3. Pod anti-affinity

Without anti-affinity, the scheduler might colocate all your data plane pods on the same node. A node failure then takes down your entire data plane.

dataplane:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/component: dataplane
          topologyKey: kubernetes.io/hostname

This ensures no two data plane pods land on the same node. If you have enough nodes, also spread across zones:

dataplane:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app.kubernetes.io/component: dataplane

4. TLS everywhere

In production, every connection should be encrypted. The chart supports three TLS layers:

gRPC TLS between data plane and control plane. The control plane serves a TLS certificate and the data plane validates it. Optionally, enable mutual TLS so the control plane also authenticates the data plane.

controlplane:
  grpcTLS:
    enabled: true
    existingSecret: "nantian-grpc-tls"
    requireClientCert: true

dataplane:
  xdsTLS:
    enabled: true
    domainName: "nantian-controlplane-grpc.nantian-gw.svc.cluster.local"

The secret must contain tls.crt, tls.key, and optionally ca.crt. For mTLS, the data plane also needs a client certificate configured via its xDS client settings.

Downstream TLS for traffic entering the gateway. This is configured through Gateway API listeners, not the Helm chart. See the TLS configuration guide for details.

Upstream TLS to backend services. Configured per-route through BackendTLSPolicy resources.

5. Security contexts

Lock down the container runtime:

controlplane:
  podSecurityContext:
    runAsNonRoot: true
    runAsUser: 65532
    runAsGroup: 65532
    fsGroup: 65532
  containerSecurityContext:
    allowPrivilegeEscalation: false
    readOnlyRootFilesystem: true

dataplane:
  podSecurityContext:
    runAsNonRoot: true
    runAsUser: 65532
    runAsGroup: 65532
    fsGroup: 65532
  containerSecurityContext:
    allowPrivilegeEscalation: false
    readOnlyRootFilesystem: true
    capabilities:
      drop: [ALL]
      add: [NET_BIND_SERVICE]

The data plane needs NET_BIND_SERVICE to bind to privileged ports (below 1024). All other capabilities are dropped. The control plane binds to high ports (18080+) so it doesn’t need any special capabilities.

6. Logging and observability

Switch to structured JSON logging in production. Your log aggregation system will parse it more reliably than text formats.

dataplane:
  config:
    log:
      level: "info,nantian_core::connectors=off"
      format: "json"
      nonBlocking: true
      nonBlockingBufferedLines: 65536
      dropWhenFull: true

nonBlocking: true keeps the data plane from stalling on slow log writes. dropWhenFull: true discards log lines rather than blocking under extreme load.

Enable access logging with a format that captures the fields your monitoring needs:

dataplane:
  config:
    accessLog:
      enabled: true
      path: "/var/log/nantian-gw/access.log"
      mode: "text"

For metrics, enable the ServiceMonitor resource if you’re using the Prometheus operator:

monitoring:
  serviceMonitor:
    enabled: true
    interval: "30s"

7. PodDisruptionBudget

PDBs prevent voluntary disruptions (node drains, cluster autoscaler) from taking down too many replicas at once.

controlplane:
  pdb:
    minAvailable: 1

dataplane:
  pdb:
    minAvailable: 2

With three data plane replicas and minAvailable: 2, the cluster can drain one node at a time without dropping below the minimum.

8. Node selection and tolerations

If you run dedicated infrastructure nodes, pin the gateway components to them:

controlplane:
  nodeSelector:
    node-role.kubernetes.io/infra: ""
  tolerations:
    - key: "node-role.kubernetes.io/infra"
      operator: "Exists"
      effect: "NoSchedule"

dataplane:
  nodeSelector:
    node-role.kubernetes.io/infra: ""
  tolerations:
    - key: "node-role.kubernetes.io/infra"
      operator: "Exists"
      effect: "NoSchedule"

9. Private registry

Production deployments should pull images from a private registry you control:

global:
  imagePullSecrets:
    - my-registry-secret

controlplane:
  image:
    registry: "registry.example.com"
    repository: "nantian-gw/nantian-controlplane"
    tag: "latest"

dataplane:
  image:
    registry: "registry.example.com"
    repository: "nantian-gw/dataplane"
    tag: "latest"

Deploying

Once you’ve assembled your values file:

helm install nantian-gw nantian-gw/nantian-gw \
  -f my-production.yaml

Helm merges the files in order, with later files overriding earlier ones.

Post-deployment verification

After the install completes, verify each checklist item:

# Check replicas are running
kubectl get pods -n nantian-gw

# Verify anti-affinity (each pod on a different node)
kubectl get pods -n nantian-gw -o wide -l app.kubernetes.io/component=dataplane

# Confirm TLS is active (look for TLS handshake in control plane logs)
kubectl logs -n nantian-gw deployment/nantian-gw-controlplane | grep -i tls

# Check PDBs
kubectl get pdb -n nantian-gw

# Verify security contexts
kubectl get pods -n nantian-gw -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].securityContext}{"\n"}{end}'

If anything doesn’t match, fix the values and upgrade:

helm upgrade nantian-gw nantian-gw/nantian-gw -f my-environment.yaml