Skip to content

Alerting Rules

This page describes recommended Prometheus alerting rules for Nantian Gateway. These rules cover the most common failure modes and performance degradations you should monitor in production.

The project ships recording rules in deploy/observability/prometheus/ but does not include pre-built alerting rules. The rules below are designed to be added to your Prometheus or Prometheus Operator configuration.

Alert when the translator fails to build a snapshot. This means data planes are serving stale configuration.

groups:
- name: nantian-controlplane
rules:
- alert: NantianSnapshotBuildFailing
expr: nantian_gateway_snapshot_last_build_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Nantian Gateway snapshot build is failing"
description: "The control plane translator has been unable to build a valid snapshot for 5 minutes. Data planes are serving the last known configuration. Check control plane logs for translation errors."

Alert when no snapshots have been published recently. This can indicate the translator is stuck or the syncer is not running.

- alert: NantianNoRecentSnapshot
expr: rate(nantian_gateway_snapshot_published_total[10m]) == 0
for: 10m
labels:
severity: warning
annotations:
summary: "Nantian Gateway has not published a snapshot recently"
description: "No snapshots have been published in the last 10 minutes. This may indicate the syncer is stuck or there are no Kubernetes resource changes. Check control plane logs."

Alert when snapshot builds are taking unusually long, which can delay configuration propagation to data planes.

- alert: NantianSlowSnapshotBuild
expr: histogram_quantile(0.99, rate(nantian_gateway_snapshot_build_duration_seconds_bucket[5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Nantian Gateway snapshot builds are slow"
description: "P99 snapshot build duration is above 10 seconds. This may indicate a large number of resources or a performance issue in the translator. Check snapshot resource counts and control plane CPU usage."

Alert when data plane streams are terminating at an elevated rate, which can indicate connectivity issues or configuration problems.

- alert: NantianHighXDSStreamTerminations
expr: rate(nantian_gateway_controlplane_xds_stream_terminations_total{reason!="shutdown"}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Nantian Gateway xDS streams are terminating"
description: "Data plane xDS streams are terminating at a rate of {{ $value }}/s. Check the reason label for details. Common causes: network issues, data plane restarts, or timeout misconfiguration."

Alert when snapshot sends are timing out, which means data planes are not receiving configuration updates.

- alert: NantianXDSSendTimeouts
expr: rate(nantian_gateway_controlplane_xds_snapshot_send_timeouts_total[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Nantian Gateway xDS snapshot sends are timing out"
description: "Data plane xDS streams are being disconnected because snapshot sends are timing out. Check network connectivity between the control plane and data planes, and review xDS keepalive configuration."

Alert when data planes are not acknowledging snapshots, which means they may not be applying configuration.

- alert: NantianXDSAckTimeouts
expr: rate(nantian_gateway_controlplane_xds_snapshot_ack_timeouts_total[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Nantian Gateway xDS ACK timeouts"
description: "Data plane xDS streams are being disconnected because ACKs are not arriving. This may indicate the data plane is stuck processing a previous snapshot or is overloaded."

Alert when the time between publishing a snapshot and receiving ACKs is high, indicating slow configuration propagation.

- alert: NantianHighAckLag
expr: histogram_quantile(0.99, rate(nantian_gateway_controlplane_xds_publish_ack_lag_seconds_bucket[5m])) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Nantian Gateway xDS ACK lag is high"
description: "P99 ACK lag is above 30 seconds. Configuration changes are taking a long time to reach data planes. Check data plane resource usage and network latency."

Alert when the reconciler runner is failing, which means infrastructure or status reconciliation is broken.

- alert: NantianReconcilerFailing
expr: nantian_gateway_controlplane_reconciler_runner_last_run_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Nantian Gateway reconciler is failing"
description: "The reconciler runner has been failing for 5 minutes. Infrastructure or status reconciliation is not completing. Check control plane logs for reconciler errors."

Alert when node status updates are being dropped because the persistence queue is full.

- alert: NantianNodeStatusDropping
expr: rate(nantian_gateway_controlplane_node_status_persist_dropped_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Nantian Gateway is dropping node status updates"
description: "Node status persistence updates are being dropped because the bounded backlog is full. This may indicate the Kubernetes API server is slow or the persistence worker is stuck."

Alert when data plane instances are not ready to serve traffic.

groups:
- name: nantian-dataplane
rules:
- alert: NantianDataplaneNotReady
expr: nantian_gateway_dataplane_not_ready_replicas > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Nantian Gateway data plane instances are not ready"
description: "{{ $value }} data plane instance(s) are not ready. Check data plane pod status and logs."

Alert when no data plane instances are ready. This means no traffic can be served.

- alert: NantianNoReadyDataplanes
expr: nantian_gateway_dataplane_ready_replicas == 0
for: 1m
labels:
severity: critical
annotations:
summary: "No Nantian Gateway data planes are ready"
description: "Zero data plane instances are ready to serve traffic. All incoming requests will fail. Check data plane deployment status immediately."

Alert when the data plane error rate exceeds a threshold.

- alert: NantianHighErrorRate
expr: |
sum(rate(nantian_gateway_dataplane_http_responses_total{status_code=~"5.."}[5m]))
/
sum(rate(nantian_gateway_dataplane_http_responses_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Nantian Gateway data plane error rate is high"
description: "5xx error rate is above 5% ({{ $value | humanizePercentage }}). Check backend health and data plane logs."

Alert when P99 request latency exceeds a threshold.

- alert: NantianHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(nantian_gateway_dataplane_http_request_duration_seconds_bucket[5m])) by (le)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Nantian Gateway data plane latency is high"
description: "P99 request latency is above 2 seconds ({{ $value }}s). Check backend response times and data plane resource usage."

Alert when backend endpoints are unhealthy.

- alert: NantianBackendUnhealthy
expr: nantian_gateway_dataplane_backend_health_status == 0
for: 2m
labels:
severity: warning
annotations:
summary: "Nantian Gateway backend endpoint is unhealthy"
description: "Backend {{ $labels.backend }} endpoint {{ $labels.endpoint }} is unhealthy. Check the backend service status."

Alert when data plane containers are being CPU throttled.

- alert: NantianDataplaneCPUThrottling
expr: nantian_gateway_dataplane_container_cpu_throttle_ratio > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "Nantian Gateway data plane CPU throttling"
description: "Data plane {{ $labels.pod }} CPU throttle ratio is {{ $value | humanizePercentage }}. Consider increasing CPU limits."

Create a PrometheusRule resource:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: nantian-gateway
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: nantian-controlplane
rules:
# ... control plane rules from above
- name: nantian-dataplane
rules:
# ... data plane rules from above

Apply it:

Terminal window
kubectl apply -f nantian-alerting-rules.yaml

Add the rules to your Prometheus configuration:

rule_files:
- /etc/prometheus/rules/nantian-controlplane.yaml
- /etc/prometheus/rules/nantian-dataplane.yaml
SeverityWhen to use
criticalTraffic is affected. Wake someone up.
warningSomething is degraded but traffic is still flowing. Investigate during business hours.

Configure Alertmanager to route alerts to the appropriate channels. Example:

route:
receiver: default
routes:
- match:
severity: critical
receiver: pagerduty
- match:
severity: warning
receiver: slack