Skip to content

Metrics Reference

This page catalogs every Prometheus metric emitted by Nantian Gateway. Both the control plane and data plane expose metrics endpoints. The control plane metrics are served on :18082 by default. Data plane metrics are served through the data plane’s admin API.

All control plane metrics use the nantian_gateway prefix. Data plane metrics use the nantian_gateway_dataplane prefix.

Metrics tracking the translator’s snapshot build process.

MetricTypeLabelsDescription
nantian_gateway_snapshot_builds_totalCounterTotal number of snapshot rebuild attempts
nantian_gateway_snapshot_build_failures_totalCounterTotal number of failed snapshot rebuilds
nantian_gateway_snapshot_published_totalCounterTotal number of successfully published snapshots
nantian_gateway_snapshot_last_build_successGauge1 if last build succeeded, 0 otherwise
nantian_gateway_snapshot_build_duration_secondsHistogramDuration of snapshot build attempts, including failures
nantian_gateway_snapshot_resource_countHistogramresourceResource counts in successfully built snapshots, partitioned by resource type
nantian_gateway_snapshot_listener_attached_routesHistogramAttached route fanout per listener across successfully built snapshots

The resource label on snapshot_resource_count can have values like Gateway, HTTPRoute, GRPCRoute, TCPRoute, Service, EndpointSlice, Secret, AIService, TokenPolicy, and WasmPlugin.

Metrics for the HTTP admin API server.

MetricTypeLabelsDescription
nantian_gateway_controlplane_admin_requests_totalCountermethod, route, status_classTotal admin API requests partitioned by HTTP method, normalized route, and status class
nantian_gateway_controlplane_admin_request_duration_secondsHistogrammethod, route, status_classDuration of admin API requests

The status_class label is derived from the HTTP status code: 2xx, 3xx, 4xx, 5xx.

Metrics for the gRPC xDS server and data plane communication.

MetricTypeLabelsDescription
nantian_gateway_controlplane_xds_snapshot_fanout_coalesced_totalCounterPer-subscriber pending snapshots replaced by newer published snapshots because a data plane stream was not keeping up
nantian_gateway_controlplane_xds_stream_terminations_totalCounterreasonxDS stream terminations partitioned by reason
nantian_gateway_controlplane_xds_status_report_rejections_totalCounterreasonData plane status reports rejected before mutating node state
nantian_gateway_controlplane_xds_snapshot_send_duration_secondsHistogramDuration of sending a snapshot to a single data plane stream
nantian_gateway_controlplane_xds_snapshot_send_timeouts_totalCounterData plane streams disconnected because snapshot sending timed out
nantian_gateway_controlplane_xds_snapshot_ack_timeouts_totalCounterData plane streams disconnected because no ACK/NACK arrived for the latest snapshot
nantian_gateway_controlplane_xds_publish_ack_lag_secondsHistogramLatency between publishing a snapshot and receiving the matching ACK
nantian_gateway_controlplane_xds_publish_nack_lag_secondsHistogramLatency between publishing a snapshot and receiving the matching NACK

Stream termination reasons in the reason label: shutdown, client_disconnect, stream_error, send_timeout, ack_timeout, superseded, invalid_request, other.

Status report rejection reasons: shutdown, invalid_request, unknown_node, other.

Metrics for the node status persistence system, which stores data plane node status in Kubernetes Lease objects.

MetricTypeLabelsDescription
nantian_gateway_controlplane_node_status_persist_queue_depthGaugeCurrent number of distinct node status updates waiting to enter the persistence worker
nantian_gateway_controlplane_node_status_persist_pending_nodesGaugeNumber of distinct node status updates waiting in the debounce window
nantian_gateway_controlplane_node_status_persist_enqueued_totalCounterTotal node status updates accepted into the bounded backlog
nantian_gateway_controlplane_node_status_persist_dropped_totalCounterNode status updates dropped because the bounded backlog was full
nantian_gateway_controlplane_node_status_persist_immediate_totalCounterImmediate node status updates accepted into the backlog
nantian_gateway_controlplane_node_status_persist_debounced_totalCounterDebounced node status updates accepted into the backlog
nantian_gateway_controlplane_node_status_persist_flush_duration_secondsHistogramDuration of flushing debounced node status persistence batches

Metrics for the custom reconciler runner that schedules infrastructure and status reconciliation.

MetricTypeLabelsDescription
nantian_gateway_controlplane_reconciler_runner_runs_totalCounterTotal custom reconciler runner executions
nantian_gateway_controlplane_reconciler_runner_failures_totalCounterFailed reconciler runner executions
nantian_gateway_controlplane_reconciler_runner_last_run_successGauge1 if the last reconciler execution succeeded, 0 otherwise
nantian_gateway_controlplane_reconciler_runner_duration_secondsHistogramscopeDuration of reconciler executions partitioned by scope
nantian_gateway_controlplane_reconciler_runner_queue_depthGaugeCurrent queue depth for reconciler triggers
nantian_gateway_controlplane_reconciler_runner_triggers_enqueued_totalCounterTriggers accepted into the queue
nantian_gateway_controlplane_reconciler_runner_triggers_deduplicated_totalCounterTriggers dropped because the queue was already full
nantian_gateway_controlplane_reconciler_runner_triggers_settled_totalCounterTriggers routed through the settle window
nantian_gateway_controlplane_reconciler_runner_settle_pendingGauge1 if a delayed settle trigger is pending, 0 otherwise
nantian_gateway_controlplane_reconciler_runner_retries_scheduled_totalCounterFailure-triggered retry runs scheduled
nantian_gateway_controlplane_reconciler_runner_retry_pendingGauge1 if a failure-triggered retry is pending, 0 otherwise

The scope label on reconciler_runner_duration_seconds can be infra, status, gateway_status, route_status, or policy_status.

MetricTypeLabelsDescription
nantian_gateway_dataplane_readyGaugenamespace, pod, job1 if the data plane is ready to serve traffic
nantian_gateway_dataplane_runtime_supervisor_http_statesGaugenamespace, pod, jobHTTP supervisor state (1 = active)
MetricTypeLabelsDescription
nantian_gateway_dataplane_http_requests_totalCounterlistener, route, status_classHTTP requests processed by listener and route
nantian_gateway_dataplane_http_request_duration_secondsHistogramlistener, routeRequest duration from arrival to response
nantian_gateway_dataplane_http_responses_totalCounterlistener, route, status_codeHTTP responses sent
MetricTypeLabelsDescription
nantian_gateway_dataplane_backend_connections_activeGaugebackendActive connections to backend services
nantian_gateway_dataplane_backend_connections_totalCounterbackendTotal connections established to backends
nantian_gateway_dataplane_backend_connection_errors_totalCounterbackend, reasonConnection errors by backend and reason
nantian_gateway_dataplane_backend_health_statusGaugebackend, endpoint1 if backend endpoint is healthy, 0 otherwise
MetricTypeLabelsDescription
nantian_gateway_dataplane_admin_requests_totalCountermethod, route, status_classData plane admin API requests

The control plane metrics endpoint is at :18082/metrics. Add this to your Prometheus scrape configuration:

scrape_configs:
- job_name: nantian-controlplane
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- nantian-gw
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: nantian-controlplane
action: keep
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex: "18082"
action: keep

The data plane metrics are served through its admin API. If using data plane aggregation, the control plane can proxy metrics requests. Otherwise, scrape each data plane pod directly:

scrape_configs:
- job_name: nantian-dataplane
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- nantian-gw
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: nantian-dataplane
action: keep

ServiceMonitors and PodMonitors for the Prometheus Operator are available in deploy/observability/prometheus/operator/.

The project ships recording rules in deploy/observability/prometheus/native/prometheus-dataplane-rules.yaml and the equivalent Prometheus Operator format in deploy/observability/prometheus/operator/prometheusrule-dataplane.yaml. These rules precompute frequently used aggregations:

RuleDescription
nantian_gateway_dataplane_ready_replicasSum of ready data plane instances
nantian_gateway_dataplane_targetsCount of data plane instances
nantian_gateway_dataplane_not_ready_replicasCount of not-ready instances
nantian_gateway_dataplane_container_cpu_coresCPU usage per container (requires cAdvisor)
nantian_gateway_dataplane_container_cpu_request_coresCPU request per container (requires kube-state-metrics)
nantian_gateway_dataplane_container_cpu_throttle_ratioCPU throttle ratio (requires cAdvisor)
nantian_gateway_dataplane_container_memory_working_set_bytesMemory working set (requires cAdvisor)
nantian_gateway_dataplane_container_memory_limit_bytesMemory limit (requires kube-state-metrics)
nantian_gateway_dataplane_container_memory_request_bytesMemory request (requires kube-state-metrics)