Alerting Rules
This page describes recommended Prometheus alerting rules for Nantian Gateway. These rules cover the most common failure modes and performance degradations you should monitor in production.
The project ships recording rules in deploy/observability/prometheus/ but does not include pre-built alerting rules. The rules below are designed to be added to your Prometheus or Prometheus Operator configuration.
Control plane alerts
Section titled “Control plane alerts”Snapshot build failures
Section titled “Snapshot build failures”Alert when the translator fails to build a snapshot. This means data planes are serving stale configuration.
groups: - name: nantian-controlplane rules: - alert: NantianSnapshotBuildFailing expr: nantian_gateway_snapshot_last_build_success == 0 for: 5m labels: severity: critical annotations: summary: "Nantian Gateway snapshot build is failing" description: "The control plane translator has been unable to build a valid snapshot for 5 minutes. Data planes are serving the last known configuration. Check control plane logs for translation errors."No published snapshots
Section titled “No published snapshots”Alert when no snapshots have been published recently. This can indicate the translator is stuck or the syncer is not running.
- alert: NantianNoRecentSnapshot expr: rate(nantian_gateway_snapshot_published_total[10m]) == 0 for: 10m labels: severity: warning annotations: summary: "Nantian Gateway has not published a snapshot recently" description: "No snapshots have been published in the last 10 minutes. This may indicate the syncer is stuck or there are no Kubernetes resource changes. Check control plane logs."High snapshot build duration
Section titled “High snapshot build duration”Alert when snapshot builds are taking unusually long, which can delay configuration propagation to data planes.
- alert: NantianSlowSnapshotBuild expr: histogram_quantile(0.99, rate(nantian_gateway_snapshot_build_duration_seconds_bucket[5m])) > 10 for: 5m labels: severity: warning annotations: summary: "Nantian Gateway snapshot builds are slow" description: "P99 snapshot build duration is above 10 seconds. This may indicate a large number of resources or a performance issue in the translator. Check snapshot resource counts and control plane CPU usage."xDS stream terminations
Section titled “xDS stream terminations”Alert when data plane streams are terminating at an elevated rate, which can indicate connectivity issues or configuration problems.
- alert: NantianHighXDSStreamTerminations expr: rate(nantian_gateway_controlplane_xds_stream_terminations_total{reason!="shutdown"}[5m]) > 0.1 for: 5m labels: severity: warning annotations: summary: "Nantian Gateway xDS streams are terminating" description: "Data plane xDS streams are terminating at a rate of {{ $value }}/s. Check the reason label for details. Common causes: network issues, data plane restarts, or timeout misconfiguration."xDS send timeouts
Section titled “xDS send timeouts”Alert when snapshot sends are timing out, which means data planes are not receiving configuration updates.
- alert: NantianXDSSendTimeouts expr: rate(nantian_gateway_controlplane_xds_snapshot_send_timeouts_total[5m]) > 0 for: 5m labels: severity: critical annotations: summary: "Nantian Gateway xDS snapshot sends are timing out" description: "Data plane xDS streams are being disconnected because snapshot sends are timing out. Check network connectivity between the control plane and data planes, and review xDS keepalive configuration."xDS ACK timeouts
Section titled “xDS ACK timeouts”Alert when data planes are not acknowledging snapshots, which means they may not be applying configuration.
- alert: NantianXDSAckTimeouts expr: rate(nantian_gateway_controlplane_xds_snapshot_ack_timeouts_total[5m]) > 0 for: 5m labels: severity: critical annotations: summary: "Nantian Gateway xDS ACK timeouts" description: "Data plane xDS streams are being disconnected because ACKs are not arriving. This may indicate the data plane is stuck processing a previous snapshot or is overloaded."High ACK lag
Section titled “High ACK lag”Alert when the time between publishing a snapshot and receiving ACKs is high, indicating slow configuration propagation.
- alert: NantianHighAckLag expr: histogram_quantile(0.99, rate(nantian_gateway_controlplane_xds_publish_ack_lag_seconds_bucket[5m])) > 30 for: 5m labels: severity: warning annotations: summary: "Nantian Gateway xDS ACK lag is high" description: "P99 ACK lag is above 30 seconds. Configuration changes are taking a long time to reach data planes. Check data plane resource usage and network latency."Reconciler failures
Section titled “Reconciler failures”Alert when the reconciler runner is failing, which means infrastructure or status reconciliation is broken.
- alert: NantianReconcilerFailing expr: nantian_gateway_controlplane_reconciler_runner_last_run_success == 0 for: 5m labels: severity: critical annotations: summary: "Nantian Gateway reconciler is failing" description: "The reconciler runner has been failing for 5 minutes. Infrastructure or status reconciliation is not completing. Check control plane logs for reconciler errors."Node status persistence dropping updates
Section titled “Node status persistence dropping updates”Alert when node status updates are being dropped because the persistence queue is full.
- alert: NantianNodeStatusDropping expr: rate(nantian_gateway_controlplane_node_status_persist_dropped_total[5m]) > 0 for: 5m labels: severity: warning annotations: summary: "Nantian Gateway is dropping node status updates" description: "Node status persistence updates are being dropped because the bounded backlog is full. This may indicate the Kubernetes API server is slow or the persistence worker is stuck."Data plane alerts
Section titled “Data plane alerts”Data plane not ready
Section titled “Data plane not ready”Alert when data plane instances are not ready to serve traffic.
groups: - name: nantian-dataplane rules: - alert: NantianDataplaneNotReady expr: nantian_gateway_dataplane_not_ready_replicas > 0 for: 5m labels: severity: critical annotations: summary: "Nantian Gateway data plane instances are not ready" description: "{{ $value }} data plane instance(s) are not ready. Check data plane pod status and logs."No ready data planes
Section titled “No ready data planes”Alert when no data plane instances are ready. This means no traffic can be served.
- alert: NantianNoReadyDataplanes expr: nantian_gateway_dataplane_ready_replicas == 0 for: 1m labels: severity: critical annotations: summary: "No Nantian Gateway data planes are ready" description: "Zero data plane instances are ready to serve traffic. All incoming requests will fail. Check data plane deployment status immediately."High error rate
Section titled “High error rate”Alert when the data plane error rate exceeds a threshold.
- alert: NantianHighErrorRate expr: | sum(rate(nantian_gateway_dataplane_http_responses_total{status_code=~"5.."}[5m])) / sum(rate(nantian_gateway_dataplane_http_responses_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: "Nantian Gateway data plane error rate is high" description: "5xx error rate is above 5% ({{ $value | humanizePercentage }}). Check backend health and data plane logs."High latency
Section titled “High latency”Alert when P99 request latency exceeds a threshold.
- alert: NantianHighLatency expr: | histogram_quantile(0.99, sum(rate(nantian_gateway_dataplane_http_request_duration_seconds_bucket[5m])) by (le) ) > 2 for: 5m labels: severity: warning annotations: summary: "Nantian Gateway data plane latency is high" description: "P99 request latency is above 2 seconds ({{ $value }}s). Check backend response times and data plane resource usage."Backend health degradation
Section titled “Backend health degradation”Alert when backend endpoints are unhealthy.
- alert: NantianBackendUnhealthy expr: nantian_gateway_dataplane_backend_health_status == 0 for: 2m labels: severity: warning annotations: summary: "Nantian Gateway backend endpoint is unhealthy" description: "Backend {{ $labels.backend }} endpoint {{ $labels.endpoint }} is unhealthy. Check the backend service status."CPU throttling
Section titled “CPU throttling”Alert when data plane containers are being CPU throttled.
- alert: NantianDataplaneCPUThrottling expr: nantian_gateway_dataplane_container_cpu_throttle_ratio > 0.1 for: 10m labels: severity: warning annotations: summary: "Nantian Gateway data plane CPU throttling" description: "Data plane {{ $labels.pod }} CPU throttle ratio is {{ $value | humanizePercentage }}. Consider increasing CPU limits."Deploying alerting rules
Section titled “Deploying alerting rules”Prometheus Operator (PrometheusRule)
Section titled “Prometheus Operator (PrometheusRule)”Create a PrometheusRule resource:
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: nantian-gateway namespace: monitoring labels: release: kube-prometheus-stackspec: groups: - name: nantian-controlplane rules: # ... control plane rules from above - name: nantian-dataplane rules: # ... data plane rules from aboveApply it:
kubectl apply -f nantian-alerting-rules.yamlNative Prometheus
Section titled “Native Prometheus”Add the rules to your Prometheus configuration:
rule_files: - /etc/prometheus/rules/nantian-controlplane.yaml - /etc/prometheus/rules/nantian-dataplane.yamlAlert severity guidelines
Section titled “Alert severity guidelines”| Severity | When to use |
|---|---|
critical | Traffic is affected. Wake someone up. |
warning | Something is degraded but traffic is still flowing. Investigate during business hours. |
Notification routing
Section titled “Notification routing”Configure Alertmanager to route alerts to the appropriate channels. Example:
route: receiver: default routes: - match: severity: critical receiver: pagerduty - match: severity: warning receiver: slackWhat’s next
Section titled “What’s next”- Troubleshooting — what to do when an alert fires
- Metrics Reference — complete catalog of all metrics used in alerting rules
- Backup & Recovery — procedures for disaster recovery