Alerting Rules

This page describes recommended Prometheus alerting rules for Nantian Gateway. These rules cover the most common failure modes and performance degradations you should monitor in production.

The project ships recording rules in deploy/observability/prometheus/ but does not include pre-built alerting rules. The rules below are designed to be added to your Prometheus or Prometheus Operator configuration.

Control plane alerts

Snapshot build failures

Alert when the translator fails to build a snapshot. This means data planes are serving stale configuration.

groups:
  - name: nantian-controlplane
    rules:
      - alert: NantianSnapshotBuildFailing
        expr: nantian_gateway_snapshot_last_build_success == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Nantian Gateway snapshot build is failing"
          description: "The control plane translator has been unable to build a valid snapshot for 5 minutes. Data planes are serving the last known configuration. Check control plane logs for translation errors."

No published snapshots

Alert when no snapshots have been published recently. This can indicate the translator is stuck or the syncer is not running.

      - alert: NantianNoRecentSnapshot
        expr: rate(nantian_gateway_snapshot_published_total[10m]) == 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Nantian Gateway has not published a snapshot recently"
          description: "No snapshots have been published in the last 10 minutes. This may indicate the syncer is stuck or there are no Kubernetes resource changes. Check control plane logs."

High snapshot build duration

Alert when snapshot builds are taking unusually long, which can delay configuration propagation to data planes.

      - alert: NantianSlowSnapshotBuild
        expr: histogram_quantile(0.99, rate(nantian_gateway_snapshot_build_duration_seconds_bucket[5m])) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Nantian Gateway snapshot builds are slow"
          description: "P99 snapshot build duration is above 10 seconds. This may indicate a large number of resources or a performance issue in the translator. Check snapshot resource counts and control plane CPU usage."

xDS stream terminations

Alert when data plane streams are terminating at an elevated rate, which can indicate connectivity issues or configuration problems.

      - alert: NantianHighXDSStreamTerminations
        expr: rate(nantian_gateway_controlplane_xds_stream_terminations_total{reason!="shutdown"}[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Nantian Gateway xDS streams are terminating"
          description: "Data plane xDS streams are terminating at a rate of {{ $value }}/s. Check the reason label for details. Common causes: network issues, data plane restarts, or timeout misconfiguration."

xDS send timeouts

Alert when snapshot sends are timing out, which means data planes are not receiving configuration updates.

      - alert: NantianXDSSendTimeouts
        expr: rate(nantian_gateway_controlplane_xds_snapshot_send_timeouts_total[5m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Nantian Gateway xDS snapshot sends are timing out"
          description: "Data plane xDS streams are being disconnected because snapshot sends are timing out. Check network connectivity between the control plane and data planes, and review xDS keepalive configuration."

xDS ACK timeouts

Alert when data planes are not acknowledging snapshots, which means they may not be applying configuration.

      - alert: NantianXDSAckTimeouts
        expr: rate(nantian_gateway_controlplane_xds_snapshot_ack_timeouts_total[5m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Nantian Gateway xDS ACK timeouts"
          description: "Data plane xDS streams are being disconnected because ACKs are not arriving. This may indicate the data plane is stuck processing a previous snapshot or is overloaded."

High ACK lag

Alert when the time between publishing a snapshot and receiving ACKs is high, indicating slow configuration propagation.

      - alert: NantianHighAckLag
        expr: histogram_quantile(0.99, rate(nantian_gateway_controlplane_xds_publish_ack_lag_seconds_bucket[5m])) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Nantian Gateway xDS ACK lag is high"
          description: "P99 ACK lag is above 30 seconds. Configuration changes are taking a long time to reach data planes. Check data plane resource usage and network latency."

Reconciler failures

Alert when the reconciler runner is failing, which means infrastructure or status reconciliation is broken.

      - alert: NantianReconcilerFailing
        expr: nantian_gateway_controlplane_reconciler_runner_last_run_success == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Nantian Gateway reconciler is failing"
          description: "The reconciler runner has been failing for 5 minutes. Infrastructure or status reconciliation is not completing. Check control plane logs for reconciler errors."

Node status persistence dropping updates

Alert when node status updates are being dropped because the persistence queue is full.

      - alert: NantianNodeStatusDropping
        expr: rate(nantian_gateway_controlplane_node_status_persist_dropped_total[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Nantian Gateway is dropping node status updates"
          description: "Node status persistence updates are being dropped because the bounded backlog is full. This may indicate the Kubernetes API server is slow or the persistence worker is stuck."

Data plane alerts

Data plane not ready

Alert when data plane instances are not ready to serve traffic.

groups:
  - name: nantian-dataplane
    rules:
      - alert: NantianDataplaneNotReady
        expr: nantian_gateway_dataplane_not_ready_replicas > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Nantian Gateway data plane instances are not ready"
          description: "{{ $value }} data plane instance(s) are not ready. Check data plane pod status and logs."

No ready data planes

Alert when no data plane instances are ready. This means no traffic can be served.

      - alert: NantianNoReadyDataplanes
        expr: nantian_gateway_dataplane_ready_replicas == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "No Nantian Gateway data planes are ready"
          description: "Zero data plane instances are ready to serve traffic. All incoming requests will fail. Check data plane deployment status immediately."

High error rate

Alert when the data plane error rate exceeds a threshold.

      - alert: NantianHighErrorRate
        expr: |
          sum(rate(nantian_gateway_dataplane_http_responses_total{status_code=~"5.."}[5m]))
          /
          sum(rate(nantian_gateway_dataplane_http_responses_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Nantian Gateway data plane error rate is high"
          description: "5xx error rate is above 5% ({{ $value | humanizePercentage }}). Check backend health and data plane logs."

High latency

Alert when P99 request latency exceeds a threshold.

      - alert: NantianHighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(nantian_gateway_dataplane_http_request_duration_seconds_bucket[5m])) by (le)
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Nantian Gateway data plane latency is high"
          description: "P99 request latency is above 2 seconds ({{ $value }}s). Check backend response times and data plane resource usage."

Backend health degradation

Alert when backend endpoints are unhealthy.

      - alert: NantianBackendUnhealthy
        expr: nantian_gateway_dataplane_backend_health_status == 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Nantian Gateway backend endpoint is unhealthy"
          description: "Backend {{ $labels.backend }} endpoint {{ $labels.endpoint }} is unhealthy. Check the backend service status."

CPU throttling

Alert when data plane containers are being CPU throttled.

      - alert: NantianDataplaneCPUThrottling
        expr: nantian_gateway_dataplane_container_cpu_throttle_ratio > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Nantian Gateway data plane CPU throttling"
          description: "Data plane {{ $labels.pod }} CPU throttle ratio is {{ $value | humanizePercentage }}. Consider increasing CPU limits."

Deploying alerting rules

Prometheus Operator (PrometheusRule)

Create a PrometheusRule resource:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: nantian-gateway
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: nantian-controlplane
      rules:
        # ... control plane rules from above
    - name: nantian-dataplane
      rules:
        # ... data plane rules from above

Apply it:

kubectl apply -f nantian-alerting-rules.yaml

Native Prometheus

Add the rules to your Prometheus configuration:

rule_files:
  - /etc/prometheus/rules/nantian-controlplane.yaml
  - /etc/prometheus/rules/nantian-dataplane.yaml

Alert severity guidelines

Severity	When to use
`critical`	Traffic is affected. Wake someone up.
`warning`	Something is degraded but traffic is still flowing. Investigate during business hours.

Notification routing

Configure Alertmanager to route alerts to the appropriate channels. Example:

route:
  receiver: default
  routes:
    - match:
        severity: critical
      receiver: pagerduty
    - match:
        severity: warning
      receiver: slack

What’s next

Troubleshooting — what to do when an alert fires
Metrics Reference — complete catalog of all metrics used in alerting rules
Backup & Recovery — procedures for disaster recovery