Metrics Reference
This page catalogs every Prometheus metric emitted by Nantian Gateway. Both the control plane and data plane expose metrics endpoints. The control plane metrics are served on :18082 by default. Data plane metrics are served through the data plane’s admin API.
All control plane metrics use the nantian_gateway prefix. Data plane metrics use the nantian_gateway_dataplane prefix.
Control plane metrics
Section titled “Control plane metrics”Snapshot translation
Section titled “Snapshot translation”Metrics tracking the translator’s snapshot build process.
| Metric | Type | Labels | Description |
|---|---|---|---|
nantian_gateway_snapshot_builds_total | Counter | Total number of snapshot rebuild attempts | |
nantian_gateway_snapshot_build_failures_total | Counter | Total number of failed snapshot rebuilds | |
nantian_gateway_snapshot_published_total | Counter | Total number of successfully published snapshots | |
nantian_gateway_snapshot_last_build_success | Gauge | 1 if last build succeeded, 0 otherwise | |
nantian_gateway_snapshot_build_duration_seconds | Histogram | Duration of snapshot build attempts, including failures | |
nantian_gateway_snapshot_resource_count | Histogram | resource | Resource counts in successfully built snapshots, partitioned by resource type |
nantian_gateway_snapshot_listener_attached_routes | Histogram | Attached route fanout per listener across successfully built snapshots |
The resource label on snapshot_resource_count can have values like Gateway, HTTPRoute, GRPCRoute, TCPRoute, Service, EndpointSlice, Secret, AIService, TokenPolicy, and WasmPlugin.
Admin API
Section titled “Admin API”Metrics for the HTTP admin API server.
| Metric | Type | Labels | Description |
|---|---|---|---|
nantian_gateway_controlplane_admin_requests_total | Counter | method, route, status_class | Total admin API requests partitioned by HTTP method, normalized route, and status class |
nantian_gateway_controlplane_admin_request_duration_seconds | Histogram | method, route, status_class | Duration of admin API requests |
The status_class label is derived from the HTTP status code: 2xx, 3xx, 4xx, 5xx.
xDS streaming
Section titled “xDS streaming”Metrics for the gRPC xDS server and data plane communication.
| Metric | Type | Labels | Description |
|---|---|---|---|
nantian_gateway_controlplane_xds_snapshot_fanout_coalesced_total | Counter | Per-subscriber pending snapshots replaced by newer published snapshots because a data plane stream was not keeping up | |
nantian_gateway_controlplane_xds_stream_terminations_total | Counter | reason | xDS stream terminations partitioned by reason |
nantian_gateway_controlplane_xds_status_report_rejections_total | Counter | reason | Data plane status reports rejected before mutating node state |
nantian_gateway_controlplane_xds_snapshot_send_duration_seconds | Histogram | Duration of sending a snapshot to a single data plane stream | |
nantian_gateway_controlplane_xds_snapshot_send_timeouts_total | Counter | Data plane streams disconnected because snapshot sending timed out | |
nantian_gateway_controlplane_xds_snapshot_ack_timeouts_total | Counter | Data plane streams disconnected because no ACK/NACK arrived for the latest snapshot | |
nantian_gateway_controlplane_xds_publish_ack_lag_seconds | Histogram | Latency between publishing a snapshot and receiving the matching ACK | |
nantian_gateway_controlplane_xds_publish_nack_lag_seconds | Histogram | Latency between publishing a snapshot and receiving the matching NACK |
Stream termination reasons in the reason label: shutdown, client_disconnect, stream_error, send_timeout, ack_timeout, superseded, invalid_request, other.
Status report rejection reasons: shutdown, invalid_request, unknown_node, other.
Node status persistence
Section titled “Node status persistence”Metrics for the node status persistence system, which stores data plane node status in Kubernetes Lease objects.
| Metric | Type | Labels | Description |
|---|---|---|---|
nantian_gateway_controlplane_node_status_persist_queue_depth | Gauge | Current number of distinct node status updates waiting to enter the persistence worker | |
nantian_gateway_controlplane_node_status_persist_pending_nodes | Gauge | Number of distinct node status updates waiting in the debounce window | |
nantian_gateway_controlplane_node_status_persist_enqueued_total | Counter | Total node status updates accepted into the bounded backlog | |
nantian_gateway_controlplane_node_status_persist_dropped_total | Counter | Node status updates dropped because the bounded backlog was full | |
nantian_gateway_controlplane_node_status_persist_immediate_total | Counter | Immediate node status updates accepted into the backlog | |
nantian_gateway_controlplane_node_status_persist_debounced_total | Counter | Debounced node status updates accepted into the backlog | |
nantian_gateway_controlplane_node_status_persist_flush_duration_seconds | Histogram | Duration of flushing debounced node status persistence batches |
Reconciler runner
Section titled “Reconciler runner”Metrics for the custom reconciler runner that schedules infrastructure and status reconciliation.
| Metric | Type | Labels | Description |
|---|---|---|---|
nantian_gateway_controlplane_reconciler_runner_runs_total | Counter | Total custom reconciler runner executions | |
nantian_gateway_controlplane_reconciler_runner_failures_total | Counter | Failed reconciler runner executions | |
nantian_gateway_controlplane_reconciler_runner_last_run_success | Gauge | 1 if the last reconciler execution succeeded, 0 otherwise | |
nantian_gateway_controlplane_reconciler_runner_duration_seconds | Histogram | scope | Duration of reconciler executions partitioned by scope |
nantian_gateway_controlplane_reconciler_runner_queue_depth | Gauge | Current queue depth for reconciler triggers | |
nantian_gateway_controlplane_reconciler_runner_triggers_enqueued_total | Counter | Triggers accepted into the queue | |
nantian_gateway_controlplane_reconciler_runner_triggers_deduplicated_total | Counter | Triggers dropped because the queue was already full | |
nantian_gateway_controlplane_reconciler_runner_triggers_settled_total | Counter | Triggers routed through the settle window | |
nantian_gateway_controlplane_reconciler_runner_settle_pending | Gauge | 1 if a delayed settle trigger is pending, 0 otherwise | |
nantian_gateway_controlplane_reconciler_runner_retries_scheduled_total | Counter | Failure-triggered retry runs scheduled | |
nantian_gateway_controlplane_reconciler_runner_retry_pending | Gauge | 1 if a failure-triggered retry is pending, 0 otherwise |
The scope label on reconciler_runner_duration_seconds can be infra, status, gateway_status, route_status, or policy_status.
Data plane metrics
Section titled “Data plane metrics”Runtime
Section titled “Runtime”| Metric | Type | Labels | Description |
|---|---|---|---|
nantian_gateway_dataplane_ready | Gauge | namespace, pod, job | 1 if the data plane is ready to serve traffic |
nantian_gateway_dataplane_runtime_supervisor_http_states | Gauge | namespace, pod, job | HTTP supervisor state (1 = active) |
Request processing
Section titled “Request processing”| Metric | Type | Labels | Description |
|---|---|---|---|
nantian_gateway_dataplane_http_requests_total | Counter | listener, route, status_class | HTTP requests processed by listener and route |
nantian_gateway_dataplane_http_request_duration_seconds | Histogram | listener, route | Request duration from arrival to response |
nantian_gateway_dataplane_http_responses_total | Counter | listener, route, status_code | HTTP responses sent |
Backend connections
Section titled “Backend connections”| Metric | Type | Labels | Description |
|---|---|---|---|
nantian_gateway_dataplane_backend_connections_active | Gauge | backend | Active connections to backend services |
nantian_gateway_dataplane_backend_connections_total | Counter | backend | Total connections established to backends |
nantian_gateway_dataplane_backend_connection_errors_total | Counter | backend, reason | Connection errors by backend and reason |
nantian_gateway_dataplane_backend_health_status | Gauge | backend, endpoint | 1 if backend endpoint is healthy, 0 otherwise |
Admin API
Section titled “Admin API”| Metric | Type | Labels | Description |
|---|---|---|---|
nantian_gateway_dataplane_admin_requests_total | Counter | method, route, status_class | Data plane admin API requests |
Scraping configuration
Section titled “Scraping configuration”Control plane scrape config
Section titled “Control plane scrape config”The control plane metrics endpoint is at :18082/metrics. Add this to your Prometheus scrape configuration:
scrape_configs: - job_name: nantian-controlplane kubernetes_sd_configs: - role: pod namespaces: names: - nantian-gw relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] regex: nantian-controlplane action: keep - source_labels: [__meta_kubernetes_pod_container_port_number] regex: "18082" action: keepData plane scrape config
Section titled “Data plane scrape config”The data plane metrics are served through its admin API. If using data plane aggregation, the control plane can proxy metrics requests. Otherwise, scrape each data plane pod directly:
scrape_configs: - job_name: nantian-dataplane kubernetes_sd_configs: - role: pod namespaces: names: - nantian-gw relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] regex: nantian-dataplane action: keepServiceMonitors and PodMonitors for the Prometheus Operator are available in deploy/observability/prometheus/operator/.
Recording rules
Section titled “Recording rules”The project ships recording rules in deploy/observability/prometheus/native/prometheus-dataplane-rules.yaml and the equivalent Prometheus Operator format in deploy/observability/prometheus/operator/prometheusrule-dataplane.yaml. These rules precompute frequently used aggregations:
| Rule | Description |
|---|---|
nantian_gateway_dataplane_ready_replicas | Sum of ready data plane instances |
nantian_gateway_dataplane_targets | Count of data plane instances |
nantian_gateway_dataplane_not_ready_replicas | Count of not-ready instances |
nantian_gateway_dataplane_container_cpu_cores | CPU usage per container (requires cAdvisor) |
nantian_gateway_dataplane_container_cpu_request_cores | CPU request per container (requires kube-state-metrics) |
nantian_gateway_dataplane_container_cpu_throttle_ratio | CPU throttle ratio (requires cAdvisor) |
nantian_gateway_dataplane_container_memory_working_set_bytes | Memory working set (requires cAdvisor) |
nantian_gateway_dataplane_container_memory_limit_bytes | Memory limit (requires kube-state-metrics) |
nantian_gateway_dataplane_container_memory_request_bytes | Memory request (requires kube-state-metrics) |
What’s next
Section titled “What’s next”- Grafana Dashboard — import and customize the pre-built dashboard
- Alerting Rules — alerting rules you should configure
- Configuration: Observability — logging, metrics, and tracing configuration