Backup & Recovery

Nantian Gateway is designed to be largely stateless. The control plane rebuilds its entire internal state from the Kubernetes API on startup. The data plane receives its configuration from the control plane over gRPC/xDS and holds no persistent state of its own. This design simplifies backup and recovery significantly: you mostly need to back up your Kubernetes resources, not the gateway itself.

What needs backing up

Nantian Gateway’s operational state lives in Kubernetes resources. Here’s what matters:

Resource	Back up?	Why
Gateway, GatewayClass	Yes	Defines listeners, ports, TLS config
HTTPRoute, GRPCRoute, TCPRoute, UDPRoute, TLSRoute	Yes	All routing rules
AIService, TokenPolicy, WasmPlugin, BackendLBPolicy	Yes	Custom CRD configuration
BackendTLSPolicy	Yes	Backend TLS configuration
ReferenceGrant	Yes	Cross-namespace access grants
TLS Secrets	Yes	Certificates for TLS termination
Backend Secrets	Yes	Backend auth credentials
ConfigMaps (Helm values, config files)	Yes	Gateway configuration
Node status Leases	No	Auto-regenerated by data planes
Data plane pods	No	Stateless, redeployed from Deployment spec
Control plane pods	No	Stateless, redeployed from Deployment spec

What doesn’t need backing up

Node status Leases: These are ephemeral records of data plane health and configuration state. They’re regenerated automatically when data planes reconnect to the control plane.
IR snapshots: The internal representation is rebuilt from Kubernetes resources on every control plane restart. There’s no need to back it up.
gRPC stream state: Streams are re-established when data planes reconnect. No state persists across restarts.
Metrics and logs: These should be backed up by your observability stack (Thanos, Loki, etc.), not by gateway-specific procedures.

Backup procedures

Option 1: Velero (recommended)

Velero backs up Kubernetes resources and persistent volumes. For Nantian Gateway, you only need resource backups since there are no persistent volumes.

Create a Velero backup that targets the gateway namespace and any namespace containing Gateway API resources:

# Backup the gateway namespace
velero backup create nantian-gw-backup \
  --include-namespaces nantian-gw \
  --wait

# Backup namespaces containing Gateway API resources
velero backup create nantian-routes-backup \
  --include-namespaces my-app,my-other-app \
  --include-resources gateways.gateway.networking.k8s.io,httproutes.gateway.networking.k8s.io,grpcroutes.gateway.networking.k8s.io,tcproutes.gateway.networking.k8s.io,udproutes.gateway.networking.k8s.io,tlsroutes.gateway.networking.k8s.io,referencegrants.gateway.networking.k8s.io,backendtlspolicies.gateway.networking.k8s.io \
  --wait

Schedule regular backups:

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: nantian-gw-daily
  namespace: velero
spec:
  schedule: "0 2 * * *"
  template:
    includedNamespaces:
      - nantian-gw
    ttl: 720h  # 30 days

Option 2: kubectl resource export

For smaller deployments or one-off backups, export resources directly:

# Export all Gateway API resources across all namespaces
kubectl get gateways.gateway.networking.k8s.io --all-namespaces -o yaml > gateways.yaml
kubectl get httproutes.gateway.networking.k8s.io --all-namespaces -o yaml > httproutes.yaml
kubectl get grpcroutes.gateway.networking.k8s.io --all-namespaces -o yaml > grpcroutes.yaml
kubectl get tcproutes.gateway.networking.k8s.io --all-namespaces -o yaml > tcproutes.yaml
kubectl get udproutes.gateway.networking.k8s.io --all-namespaces -o yaml > udproutes.yaml
kubectl get tlsroutes.gateway.networking.k8s.io --all-namespaces -o yaml > tlsroutes.yaml
kubectl get referencegrants.gateway.networking.k8s.io --all-namespaces -o yaml > referencegrants.yaml
kubectl get backendtlspolicies.gateway.networking.k8s.io --all-namespaces -o yaml > backendtlspolicies.yaml

# Export custom CRDs
kubectl get aiservices.nantian.dev --all-namespaces -o yaml > aiservices.yaml
kubectl get tokenpolicies.nantian.dev --all-namespaces -o yaml > tokenpolicies.yaml
kubectl get wasmplugins.nantian.dev --all-namespaces -o yaml > wasmplugins.yaml
kubectl get backendlbpolicies.nantian.dev --all-namespaces -o yaml > backendlbpolicies.yaml

# Export TLS secrets (handle with care - these contain sensitive data)
kubectl get secrets -n nantian-gw -o yaml > nantian-gw-secrets.yaml

# Export Helm values if you customized them
helm get values nantian-gw -n nantian-gw > nantian-gw-helm-values.yaml

Store these files in a secure, version-controlled location. Encrypt the secrets file if storing outside the cluster.

Option 3: GitOps (preferred for production)

If you manage your Gateway API resources through GitOps (ArgoCD, Flux), your resources are already backed up in git. Make sure your git repository includes:

All Gateway API resources (Gateway, HTTPRoute, etc.)
Custom CRD resources (AIService, TokenPolicy, WasmPlugin, BackendLBPolicy)
Helm values files or Kustomize overlays
TLS certificate definitions (store certificates in a secrets manager, not plaintext in git)

With GitOps, recovery is a matter of reapplying the resources from git. The gateway will pick them up automatically.

Recovery procedures

Full cluster recovery

If you lose the entire cluster, recover in this order:

Restore the Kubernetes cluster (new cluster or recovered cluster)

Install Gateway API CRDs:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.5.1/standard-install.yaml

Install Nantian Gateway CRDs (AIService, TokenPolicy, WasmPlugin, BackendLBPolicy)
Install Nantian Gateway via Helm:
Terminal window

helm install nantian-gw nantian-gw/nantian-gw
-f nantian-gw-helm-values.yaml
-n nantian-gw —create-namespace

5. **Wait for the control plane to become ready**:
```bash
kubectl wait --for=condition=ready pod -l app=nantian-controlplane -n nantian-gw --timeout=120s

Restore Gateway API resources:

kubectl apply -f gateways.yaml
kubectl apply -f httproutes.yaml
kubectl apply -f grpcroutes.yaml
# ... apply all resource files

Restore TLS secrets:

kubectl apply -f nantian-gw-secrets.yaml

Restore custom CRD resources:

kubectl apply -f aiservices.yaml
kubectl apply -f tokenpolicies.yaml
kubectl apply -f wasmplugins.yaml
kubectl apply -f backendlbpolicies.yaml

Verify the gateway is operational:

kubectl exec -n nantian-gw deployment/nantian-gw-controlplane -- \
  curl -s http://localhost:18081/v1/summary | jq .

Control plane recovery

If only the control plane fails (data planes are still running):

# Restart the control plane
kubectl rollout restart deployment/nantian-gw-controlplane -n nantian-gw

# Wait for it to become ready
kubectl wait --for=condition=ready pod -l app=nantian-controlplane -n nantian-gw --timeout=120s

The control plane rebuilds its state from the Kubernetes API. Data planes continue serving traffic with their last configuration during the restart. Once the control plane is ready, data planes reconnect and receive fresh configuration.

Data plane recovery

If data planes fail but the control plane is healthy:

# Restart the data plane
kubectl rollout restart deployment/nantian-gw-dataplane -n nantian-gw

# Wait for data planes to become ready
kubectl wait --for=condition=ready pod -l app=nantian-dataplane -n nantian-gw --timeout=120s

Data planes reconnect to the control plane, receive the current snapshot, and resume serving traffic. If you have multiple data plane replicas, rolling restart ensures no traffic interruption.

Accidental resource deletion

If a Gateway API resource is accidentally deleted:

Restore from backup: Apply the resource YAML from your backup
Or recreate from GitOps: If using GitOps, the resource will be automatically reconciled

The control plane detects the restored resource within the sync period (default 30 seconds) and rebuilds the snapshot. Data planes receive the updated configuration on the next snapshot publish.

Disaster recovery testing

Test your backup and recovery procedures regularly. A quarterly test is recommended:

Create a test namespace with a copy of your Gateway API resources
Delete the resources in the test namespace
Restore from backup and verify the gateway picks up the restored resources
Verify traffic flow by sending test requests through the restored routes

# Example test flow
kubectl create namespace nantian-dr-test
kubectl get gateways,httproutes -n production -o yaml | \
  sed 's/namespace: production/namespace: nantian-dr-test/' | \
  kubectl apply -n nantian-dr-test -f -

# Verify the gateway sees the test resources
kubectl exec -n nantian-gw deployment/nantian-gw-controlplane -- \
  curl -s http://localhost:18081/v1/listeners | jq '.[] | select(.namespace == "nantian-dr-test")'

# Clean up
kubectl delete namespace nantian-dr-test

Recovery time expectations

Scenario	Expected recovery time	Notes
Control plane restart	10-30 seconds	Leader election + rebuild from API
Data plane restart	10-30 seconds	Reconnect + receive snapshot
Full cluster recovery	5-15 minutes	Depends on cluster provisioning and resource count
Accidental resource deletion	30-60 seconds	After resource is reapplied

The control plane’s recovery time scales with the number of Gateway API resources. A cluster with thousands of routes will take longer to rebuild the IR snapshot than one with dozens. Monitor the nantian_gateway_snapshot_build_duration_seconds metric to understand your baseline build time.

What’s next

Troubleshooting — what to do when recovery doesn’t go as planned
Installation: High Availability — deploy with redundancy to avoid needing recovery
Alerting Rules — get alerted before you need to recover