Skip to content

Backup & Recovery

Nantian Gateway is designed to be largely stateless. The control plane rebuilds its entire internal state from the Kubernetes API on startup. The data plane receives its configuration from the control plane over gRPC/xDS and holds no persistent state of its own. This design simplifies backup and recovery significantly: you mostly need to back up your Kubernetes resources, not the gateway itself.

Nantian Gateway’s operational state lives in Kubernetes resources. Here’s what matters:

ResourceBack up?Why
Gateway, GatewayClassYesDefines listeners, ports, TLS config
HTTPRoute, GRPCRoute, TCPRoute, UDPRoute, TLSRouteYesAll routing rules
AIService, TokenPolicy, WasmPlugin, BackendLBPolicyYesCustom CRD configuration
BackendTLSPolicyYesBackend TLS configuration
ReferenceGrantYesCross-namespace access grants
TLS SecretsYesCertificates for TLS termination
Backend SecretsYesBackend auth credentials
ConfigMaps (Helm values, config files)YesGateway configuration
Node status LeasesNoAuto-regenerated by data planes
Data plane podsNoStateless, redeployed from Deployment spec
Control plane podsNoStateless, redeployed from Deployment spec
  • Node status Leases: These are ephemeral records of data plane health and configuration state. They’re regenerated automatically when data planes reconnect to the control plane.
  • IR snapshots: The internal representation is rebuilt from Kubernetes resources on every control plane restart. There’s no need to back it up.
  • gRPC stream state: Streams are re-established when data planes reconnect. No state persists across restarts.
  • Metrics and logs: These should be backed up by your observability stack (Thanos, Loki, etc.), not by gateway-specific procedures.

Velero backs up Kubernetes resources and persistent volumes. For Nantian Gateway, you only need resource backups since there are no persistent volumes.

Create a Velero backup that targets the gateway namespace and any namespace containing Gateway API resources:

Terminal window
# Backup the gateway namespace
velero backup create nantian-gw-backup \
--include-namespaces nantian-gw \
--wait
# Backup namespaces containing Gateway API resources
velero backup create nantian-routes-backup \
--include-namespaces my-app,my-other-app \
--include-resources gateways.gateway.networking.k8s.io,httproutes.gateway.networking.k8s.io,grpcroutes.gateway.networking.k8s.io,tcproutes.gateway.networking.k8s.io,udproutes.gateway.networking.k8s.io,tlsroutes.gateway.networking.k8s.io,referencegrants.gateway.networking.k8s.io,backendtlspolicies.gateway.networking.k8s.io \
--wait

Schedule regular backups:

apiVersion: velero.io/v1
kind: Schedule
metadata:
name: nantian-gw-daily
namespace: velero
spec:
schedule: "0 2 * * *"
template:
includedNamespaces:
- nantian-gw
ttl: 720h # 30 days

For smaller deployments or one-off backups, export resources directly:

Terminal window
# Export all Gateway API resources across all namespaces
kubectl get gateways.gateway.networking.k8s.io --all-namespaces -o yaml > gateways.yaml
kubectl get httproutes.gateway.networking.k8s.io --all-namespaces -o yaml > httproutes.yaml
kubectl get grpcroutes.gateway.networking.k8s.io --all-namespaces -o yaml > grpcroutes.yaml
kubectl get tcproutes.gateway.networking.k8s.io --all-namespaces -o yaml > tcproutes.yaml
kubectl get udproutes.gateway.networking.k8s.io --all-namespaces -o yaml > udproutes.yaml
kubectl get tlsroutes.gateway.networking.k8s.io --all-namespaces -o yaml > tlsroutes.yaml
kubectl get referencegrants.gateway.networking.k8s.io --all-namespaces -o yaml > referencegrants.yaml
kubectl get backendtlspolicies.gateway.networking.k8s.io --all-namespaces -o yaml > backendtlspolicies.yaml
# Export custom CRDs
kubectl get aiservices.nantian.dev --all-namespaces -o yaml > aiservices.yaml
kubectl get tokenpolicies.nantian.dev --all-namespaces -o yaml > tokenpolicies.yaml
kubectl get wasmplugins.nantian.dev --all-namespaces -o yaml > wasmplugins.yaml
kubectl get backendlbpolicies.nantian.dev --all-namespaces -o yaml > backendlbpolicies.yaml
# Export TLS secrets (handle with care - these contain sensitive data)
kubectl get secrets -n nantian-gw -o yaml > nantian-gw-secrets.yaml
# Export Helm values if you customized them
helm get values nantian-gw -n nantian-gw > nantian-gw-helm-values.yaml

Store these files in a secure, version-controlled location. Encrypt the secrets file if storing outside the cluster.

Option 3: GitOps (preferred for production)

Section titled “Option 3: GitOps (preferred for production)”

If you manage your Gateway API resources through GitOps (ArgoCD, Flux), your resources are already backed up in git. Make sure your git repository includes:

  • All Gateway API resources (Gateway, HTTPRoute, etc.)
  • Custom CRD resources (AIService, TokenPolicy, WasmPlugin, BackendLBPolicy)
  • Helm values files or Kustomize overlays
  • TLS certificate definitions (store certificates in a secrets manager, not plaintext in git)

With GitOps, recovery is a matter of reapplying the resources from git. The gateway will pick them up automatically.

If you lose the entire cluster, recover in this order:

  1. Restore the Kubernetes cluster (new cluster or recovered cluster)
  2. Install Gateway API CRDs:
    Terminal window
    kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.5.1/standard-install.yaml
  3. Install Nantian Gateway CRDs (AIService, TokenPolicy, WasmPlugin, BackendLBPolicy)
  4. Install Nantian Gateway via Helm:
    Terminal window

helm install nantian-gw nantian-gw/nantian-gw
-f nantian-gw-helm-values.yaml
-n nantian-gw —create-namespace

5. **Wait for the control plane to become ready**:
```bash
kubectl wait --for=condition=ready pod -l app=nantian-controlplane -n nantian-gw --timeout=120s
  1. Restore Gateway API resources:
    Terminal window
    kubectl apply -f gateways.yaml
    kubectl apply -f httproutes.yaml
    kubectl apply -f grpcroutes.yaml
    # ... apply all resource files
  2. Restore TLS secrets:
    Terminal window
    kubectl apply -f nantian-gw-secrets.yaml
  3. Restore custom CRD resources:
    Terminal window
    kubectl apply -f aiservices.yaml
    kubectl apply -f tokenpolicies.yaml
    kubectl apply -f wasmplugins.yaml
    kubectl apply -f backendlbpolicies.yaml
  4. Verify the gateway is operational:
    Terminal window
    kubectl exec -n nantian-gw deployment/nantian-gw-controlplane -- \
    curl -s http://localhost:18081/v1/summary | jq .

If only the control plane fails (data planes are still running):

Terminal window
# Restart the control plane
kubectl rollout restart deployment/nantian-gw-controlplane -n nantian-gw
# Wait for it to become ready
kubectl wait --for=condition=ready pod -l app=nantian-controlplane -n nantian-gw --timeout=120s

The control plane rebuilds its state from the Kubernetes API. Data planes continue serving traffic with their last configuration during the restart. Once the control plane is ready, data planes reconnect and receive fresh configuration.

If data planes fail but the control plane is healthy:

Terminal window
# Restart the data plane
kubectl rollout restart deployment/nantian-gw-dataplane -n nantian-gw
# Wait for data planes to become ready
kubectl wait --for=condition=ready pod -l app=nantian-dataplane -n nantian-gw --timeout=120s

Data planes reconnect to the control plane, receive the current snapshot, and resume serving traffic. If you have multiple data plane replicas, rolling restart ensures no traffic interruption.

If a Gateway API resource is accidentally deleted:

  1. Restore from backup: Apply the resource YAML from your backup
  2. Or recreate from GitOps: If using GitOps, the resource will be automatically reconciled

The control plane detects the restored resource within the sync period (default 30 seconds) and rebuilds the snapshot. Data planes receive the updated configuration on the next snapshot publish.

Test your backup and recovery procedures regularly. A quarterly test is recommended:

  1. Create a test namespace with a copy of your Gateway API resources
  2. Delete the resources in the test namespace
  3. Restore from backup and verify the gateway picks up the restored resources
  4. Verify traffic flow by sending test requests through the restored routes
Terminal window
# Example test flow
kubectl create namespace nantian-dr-test
kubectl get gateways,httproutes -n production -o yaml | \
sed 's/namespace: production/namespace: nantian-dr-test/' | \
kubectl apply -n nantian-dr-test -f -
# Verify the gateway sees the test resources
kubectl exec -n nantian-gw deployment/nantian-gw-controlplane -- \
curl -s http://localhost:18081/v1/listeners | jq '.[] | select(.namespace == "nantian-dr-test")'
# Clean up
kubectl delete namespace nantian-dr-test
ScenarioExpected recovery timeNotes
Control plane restart10-30 secondsLeader election + rebuild from API
Data plane restart10-30 secondsReconnect + receive snapshot
Full cluster recovery5-15 minutesDepends on cluster provisioning and resource count
Accidental resource deletion30-60 secondsAfter resource is reapplied

The control plane’s recovery time scales with the number of Gateway API resources. A cluster with thousands of routes will take longer to rebuild the IR snapshot than one with dozens. Monitor the nantian_gateway_snapshot_build_duration_seconds metric to understand your baseline build time.