Backup & Recovery
Nantian Gateway is designed to be largely stateless. The control plane rebuilds its entire internal state from the Kubernetes API on startup. The data plane receives its configuration from the control plane over gRPC/xDS and holds no persistent state of its own. This design simplifies backup and recovery significantly: you mostly need to back up your Kubernetes resources, not the gateway itself.
What needs backing up
Section titled “What needs backing up”Nantian Gateway’s operational state lives in Kubernetes resources. Here’s what matters:
| Resource | Back up? | Why |
|---|---|---|
| Gateway, GatewayClass | Yes | Defines listeners, ports, TLS config |
| HTTPRoute, GRPCRoute, TCPRoute, UDPRoute, TLSRoute | Yes | All routing rules |
| AIService, TokenPolicy, WasmPlugin, BackendLBPolicy | Yes | Custom CRD configuration |
| BackendTLSPolicy | Yes | Backend TLS configuration |
| ReferenceGrant | Yes | Cross-namespace access grants |
| TLS Secrets | Yes | Certificates for TLS termination |
| Backend Secrets | Yes | Backend auth credentials |
| ConfigMaps (Helm values, config files) | Yes | Gateway configuration |
| Node status Leases | No | Auto-regenerated by data planes |
| Data plane pods | No | Stateless, redeployed from Deployment spec |
| Control plane pods | No | Stateless, redeployed from Deployment spec |
What doesn’t need backing up
Section titled “What doesn’t need backing up”- Node status Leases: These are ephemeral records of data plane health and configuration state. They’re regenerated automatically when data planes reconnect to the control plane.
- IR snapshots: The internal representation is rebuilt from Kubernetes resources on every control plane restart. There’s no need to back it up.
- gRPC stream state: Streams are re-established when data planes reconnect. No state persists across restarts.
- Metrics and logs: These should be backed up by your observability stack (Thanos, Loki, etc.), not by gateway-specific procedures.
Backup procedures
Section titled “Backup procedures”Option 1: Velero (recommended)
Section titled “Option 1: Velero (recommended)”Velero backs up Kubernetes resources and persistent volumes. For Nantian Gateway, you only need resource backups since there are no persistent volumes.
Create a Velero backup that targets the gateway namespace and any namespace containing Gateway API resources:
# Backup the gateway namespacevelero backup create nantian-gw-backup \ --include-namespaces nantian-gw \ --wait
# Backup namespaces containing Gateway API resourcesvelero backup create nantian-routes-backup \ --include-namespaces my-app,my-other-app \ --include-resources gateways.gateway.networking.k8s.io,httproutes.gateway.networking.k8s.io,grpcroutes.gateway.networking.k8s.io,tcproutes.gateway.networking.k8s.io,udproutes.gateway.networking.k8s.io,tlsroutes.gateway.networking.k8s.io,referencegrants.gateway.networking.k8s.io,backendtlspolicies.gateway.networking.k8s.io \ --waitSchedule regular backups:
apiVersion: velero.io/v1kind: Schedulemetadata: name: nantian-gw-daily namespace: velerospec: schedule: "0 2 * * *" template: includedNamespaces: - nantian-gw ttl: 720h # 30 daysOption 2: kubectl resource export
Section titled “Option 2: kubectl resource export”For smaller deployments or one-off backups, export resources directly:
# Export all Gateway API resources across all namespaceskubectl get gateways.gateway.networking.k8s.io --all-namespaces -o yaml > gateways.yamlkubectl get httproutes.gateway.networking.k8s.io --all-namespaces -o yaml > httproutes.yamlkubectl get grpcroutes.gateway.networking.k8s.io --all-namespaces -o yaml > grpcroutes.yamlkubectl get tcproutes.gateway.networking.k8s.io --all-namespaces -o yaml > tcproutes.yamlkubectl get udproutes.gateway.networking.k8s.io --all-namespaces -o yaml > udproutes.yamlkubectl get tlsroutes.gateway.networking.k8s.io --all-namespaces -o yaml > tlsroutes.yamlkubectl get referencegrants.gateway.networking.k8s.io --all-namespaces -o yaml > referencegrants.yamlkubectl get backendtlspolicies.gateway.networking.k8s.io --all-namespaces -o yaml > backendtlspolicies.yaml
# Export custom CRDskubectl get aiservices.nantian.dev --all-namespaces -o yaml > aiservices.yamlkubectl get tokenpolicies.nantian.dev --all-namespaces -o yaml > tokenpolicies.yamlkubectl get wasmplugins.nantian.dev --all-namespaces -o yaml > wasmplugins.yamlkubectl get backendlbpolicies.nantian.dev --all-namespaces -o yaml > backendlbpolicies.yaml
# Export TLS secrets (handle with care - these contain sensitive data)kubectl get secrets -n nantian-gw -o yaml > nantian-gw-secrets.yaml
# Export Helm values if you customized themhelm get values nantian-gw -n nantian-gw > nantian-gw-helm-values.yamlStore these files in a secure, version-controlled location. Encrypt the secrets file if storing outside the cluster.
Option 3: GitOps (preferred for production)
Section titled “Option 3: GitOps (preferred for production)”If you manage your Gateway API resources through GitOps (ArgoCD, Flux), your resources are already backed up in git. Make sure your git repository includes:
- All Gateway API resources (Gateway, HTTPRoute, etc.)
- Custom CRD resources (AIService, TokenPolicy, WasmPlugin, BackendLBPolicy)
- Helm values files or Kustomize overlays
- TLS certificate definitions (store certificates in a secrets manager, not plaintext in git)
With GitOps, recovery is a matter of reapplying the resources from git. The gateway will pick them up automatically.
Recovery procedures
Section titled “Recovery procedures”Full cluster recovery
Section titled “Full cluster recovery”If you lose the entire cluster, recover in this order:
- Restore the Kubernetes cluster (new cluster or recovered cluster)
- Install Gateway API CRDs:
Terminal window kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.5.1/standard-install.yaml - Install Nantian Gateway CRDs (AIService, TokenPolicy, WasmPlugin, BackendLBPolicy)
- Install Nantian Gateway via Helm:
Terminal window
helm install nantian-gw nantian-gw/nantian-gw
-f nantian-gw-helm-values.yaml
-n nantian-gw —create-namespace
5. **Wait for the control plane to become ready**:```bashkubectl wait --for=condition=ready pod -l app=nantian-controlplane -n nantian-gw --timeout=120s- Restore Gateway API resources:
Terminal window kubectl apply -f gateways.yamlkubectl apply -f httproutes.yamlkubectl apply -f grpcroutes.yaml# ... apply all resource files - Restore TLS secrets:
Terminal window kubectl apply -f nantian-gw-secrets.yaml - Restore custom CRD resources:
Terminal window kubectl apply -f aiservices.yamlkubectl apply -f tokenpolicies.yamlkubectl apply -f wasmplugins.yamlkubectl apply -f backendlbpolicies.yaml - Verify the gateway is operational:
Terminal window kubectl exec -n nantian-gw deployment/nantian-gw-controlplane -- \curl -s http://localhost:18081/v1/summary | jq .
Control plane recovery
Section titled “Control plane recovery”If only the control plane fails (data planes are still running):
# Restart the control planekubectl rollout restart deployment/nantian-gw-controlplane -n nantian-gw
# Wait for it to become readykubectl wait --for=condition=ready pod -l app=nantian-controlplane -n nantian-gw --timeout=120sThe control plane rebuilds its state from the Kubernetes API. Data planes continue serving traffic with their last configuration during the restart. Once the control plane is ready, data planes reconnect and receive fresh configuration.
Data plane recovery
Section titled “Data plane recovery”If data planes fail but the control plane is healthy:
# Restart the data planekubectl rollout restart deployment/nantian-gw-dataplane -n nantian-gw
# Wait for data planes to become readykubectl wait --for=condition=ready pod -l app=nantian-dataplane -n nantian-gw --timeout=120sData planes reconnect to the control plane, receive the current snapshot, and resume serving traffic. If you have multiple data plane replicas, rolling restart ensures no traffic interruption.
Accidental resource deletion
Section titled “Accidental resource deletion”If a Gateway API resource is accidentally deleted:
- Restore from backup: Apply the resource YAML from your backup
- Or recreate from GitOps: If using GitOps, the resource will be automatically reconciled
The control plane detects the restored resource within the sync period (default 30 seconds) and rebuilds the snapshot. Data planes receive the updated configuration on the next snapshot publish.
Disaster recovery testing
Section titled “Disaster recovery testing”Test your backup and recovery procedures regularly. A quarterly test is recommended:
- Create a test namespace with a copy of your Gateway API resources
- Delete the resources in the test namespace
- Restore from backup and verify the gateway picks up the restored resources
- Verify traffic flow by sending test requests through the restored routes
# Example test flowkubectl create namespace nantian-dr-testkubectl get gateways,httproutes -n production -o yaml | \ sed 's/namespace: production/namespace: nantian-dr-test/' | \ kubectl apply -n nantian-dr-test -f -
# Verify the gateway sees the test resourceskubectl exec -n nantian-gw deployment/nantian-gw-controlplane -- \ curl -s http://localhost:18081/v1/listeners | jq '.[] | select(.namespace == "nantian-dr-test")'
# Clean upkubectl delete namespace nantian-dr-testRecovery time expectations
Section titled “Recovery time expectations”| Scenario | Expected recovery time | Notes |
|---|---|---|
| Control plane restart | 10-30 seconds | Leader election + rebuild from API |
| Data plane restart | 10-30 seconds | Reconnect + receive snapshot |
| Full cluster recovery | 5-15 minutes | Depends on cluster provisioning and resource count |
| Accidental resource deletion | 30-60 seconds | After resource is reapplied |
The control plane’s recovery time scales with the number of Gateway API resources. A cluster with thousands of routes will take longer to rebuild the IR snapshot than one with dozens. Monitor the nantian_gateway_snapshot_build_duration_seconds metric to understand your baseline build time.
What’s next
Section titled “What’s next”- Troubleshooting — what to do when recovery doesn’t go as planned
- Installation: High Availability — deploy with redundancy to avoid needing recovery
- Alerting Rules — get alerted before you need to recover