Metrics Reference

This page catalogs every Prometheus metric emitted by Nantian Gateway. Both the control plane and data plane expose metrics endpoints. The control plane metrics are served on :18082 by default. Data plane metrics are served through the data plane’s admin API.

All control plane metrics use the nantian_gateway prefix. Data plane metrics use the nantian_gateway_dataplane prefix.

Control plane metrics

Snapshot translation

Metrics tracking the translator’s snapshot build process.

Metric	Type	Labels	Description
`nantian_gateway_snapshot_builds_total`	Counter		Total number of snapshot rebuild attempts
`nantian_gateway_snapshot_build_failures_total`	Counter		Total number of failed snapshot rebuilds
`nantian_gateway_snapshot_published_total`	Counter		Total number of successfully published snapshots
`nantian_gateway_snapshot_last_build_success`	Gauge		1 if last build succeeded, 0 otherwise
`nantian_gateway_snapshot_build_duration_seconds`	Histogram		Duration of snapshot build attempts, including failures
`nantian_gateway_snapshot_resource_count`	Histogram	`resource`	Resource counts in successfully built snapshots, partitioned by resource type
`nantian_gateway_snapshot_listener_attached_routes`	Histogram		Attached route fanout per listener across successfully built snapshots

The resource label on snapshot_resource_count can have values like Gateway, HTTPRoute, GRPCRoute, TCPRoute, Service, EndpointSlice, Secret, AIService, TokenPolicy, and WasmPlugin.

Admin API

Metrics for the HTTP admin API server.

Metric	Type	Labels	Description
`nantian_gateway_controlplane_admin_requests_total`	Counter	`method`, `route`, `status_class`	Total admin API requests partitioned by HTTP method, normalized route, and status class
`nantian_gateway_controlplane_admin_request_duration_seconds`	Histogram	`method`, `route`, `status_class`	Duration of admin API requests

The status_class label is derived from the HTTP status code: 2xx, 3xx, 4xx, 5xx.

xDS streaming

Metrics for the gRPC xDS server and data plane communication.

Metric	Type	Labels	Description
`nantian_gateway_controlplane_xds_snapshot_fanout_coalesced_total`	Counter		Per-subscriber pending snapshots replaced by newer published snapshots because a data plane stream was not keeping up
`nantian_gateway_controlplane_xds_stream_terminations_total`	Counter	`reason`	xDS stream terminations partitioned by reason
`nantian_gateway_controlplane_xds_status_report_rejections_total`	Counter	`reason`	Data plane status reports rejected before mutating node state
`nantian_gateway_controlplane_xds_snapshot_send_duration_seconds`	Histogram		Duration of sending a snapshot to a single data plane stream
`nantian_gateway_controlplane_xds_snapshot_send_timeouts_total`	Counter		Data plane streams disconnected because snapshot sending timed out
`nantian_gateway_controlplane_xds_snapshot_ack_timeouts_total`	Counter		Data plane streams disconnected because no ACK/NACK arrived for the latest snapshot
`nantian_gateway_controlplane_xds_publish_ack_lag_seconds`	Histogram		Latency between publishing a snapshot and receiving the matching ACK
`nantian_gateway_controlplane_xds_publish_nack_lag_seconds`	Histogram		Latency between publishing a snapshot and receiving the matching NACK

Stream termination reasons in the reason label: shutdown, client_disconnect, stream_error, send_timeout, ack_timeout, superseded, invalid_request, other.

Status report rejection reasons: shutdown, invalid_request, unknown_node, other.

Node status persistence

Metrics for the node status persistence system, which stores data plane node status in Kubernetes Lease objects.

Metric	Type	Description
`nantian_gateway_controlplane_node_status_persist_queue_depth`	Gauge	Current number of distinct node status updates waiting to enter the persistence worker
`nantian_gateway_controlplane_node_status_persist_pending_nodes`	Gauge	Number of distinct node status updates waiting in the debounce window
`nantian_gateway_controlplane_node_status_persist_enqueued_total`	Counter	Total node status updates accepted into the bounded backlog
`nantian_gateway_controlplane_node_status_persist_dropped_total`	Counter	Node status updates dropped because the bounded backlog was full
`nantian_gateway_controlplane_node_status_persist_immediate_total`	Counter	Immediate node status updates accepted into the backlog
`nantian_gateway_controlplane_node_status_persist_debounced_total`	Counter	Debounced node status updates accepted into the backlog
`nantian_gateway_controlplane_node_status_persist_flush_duration_seconds`	Histogram	Duration of flushing debounced node status persistence batches

Reconciler runner

Metrics for the custom reconciler runner that schedules infrastructure and status reconciliation.

Metric	Type	Labels	Description
`nantian_gateway_controlplane_reconciler_runner_runs_total`	Counter		Total custom reconciler runner executions
`nantian_gateway_controlplane_reconciler_runner_failures_total`	Counter		Failed reconciler runner executions
`nantian_gateway_controlplane_reconciler_runner_last_run_success`	Gauge		1 if the last reconciler execution succeeded, 0 otherwise
`nantian_gateway_controlplane_reconciler_runner_duration_seconds`	Histogram	`scope`	Duration of reconciler executions partitioned by scope
`nantian_gateway_controlplane_reconciler_runner_queue_depth`	Gauge		Current queue depth for reconciler triggers
`nantian_gateway_controlplane_reconciler_runner_triggers_enqueued_total`	Counter		Triggers accepted into the queue
`nantian_gateway_controlplane_reconciler_runner_triggers_deduplicated_total`	Counter		Triggers dropped because the queue was already full
`nantian_gateway_controlplane_reconciler_runner_triggers_settled_total`	Counter		Triggers routed through the settle window
`nantian_gateway_controlplane_reconciler_runner_settle_pending`	Gauge		1 if a delayed settle trigger is pending, 0 otherwise
`nantian_gateway_controlplane_reconciler_runner_retries_scheduled_total`	Counter		Failure-triggered retry runs scheduled
`nantian_gateway_controlplane_reconciler_runner_retry_pending`	Gauge		1 if a failure-triggered retry is pending, 0 otherwise

The scope label on reconciler_runner_duration_seconds can be infra, status, gateway_status, route_status, or policy_status.

Data plane metrics

Runtime

Metric	Type	Labels	Description
`nantian_gateway_dataplane_ready`	Gauge	`namespace`, `pod`, `job`	1 if the data plane is ready to serve traffic
`nantian_gateway_dataplane_runtime_supervisor_http_states`	Gauge	`namespace`, `pod`, `job`	HTTP supervisor state (1 = active)

Request processing

Metric	Type	Labels	Description
`nantian_gateway_dataplane_http_requests_total`	Counter	`listener`, `route`, `status_class`	HTTP requests processed by listener and route
`nantian_gateway_dataplane_http_request_duration_seconds`	Histogram	`listener`, `route`	Request duration from arrival to response
`nantian_gateway_dataplane_http_responses_total`	Counter	`listener`, `route`, `status_code`	HTTP responses sent

Backend connections

Metric	Type	Labels	Description
`nantian_gateway_dataplane_backend_connections_active`	Gauge	`backend`	Active connections to backend services
`nantian_gateway_dataplane_backend_connections_total`	Counter	`backend`	Total connections established to backends
`nantian_gateway_dataplane_backend_connection_errors_total`	Counter	`backend`, `reason`	Connection errors by backend and reason
`nantian_gateway_dataplane_backend_health_status`	Gauge	`backend`, `endpoint`	1 if backend endpoint is healthy, 0 otherwise

Admin API

Metric	Type	Labels	Description
`nantian_gateway_dataplane_admin_requests_total`	Counter	`method`, `route`, `status_class`	Data plane admin API requests

Scraping configuration

Control plane scrape config

The control plane metrics endpoint is at :18082/metrics. Add this to your Prometheus scrape configuration:

scrape_configs:
  - job_name: nantian-controlplane
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - nantian-gw
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: nantian-controlplane
        action: keep
      - source_labels: [__meta_kubernetes_pod_container_port_number]
        regex: "18082"
        action: keep

Data plane scrape config

The data plane metrics are served through its admin API. If using data plane aggregation, the control plane can proxy metrics requests. Otherwise, scrape each data plane pod directly:

scrape_configs:
  - job_name: nantian-dataplane
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - nantian-gw
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: nantian-dataplane
        action: keep

ServiceMonitors and PodMonitors for the Prometheus Operator are available in deploy/observability/prometheus/operator/.

Recording rules

The project ships recording rules in deploy/observability/prometheus/native/prometheus-dataplane-rules.yaml and the equivalent Prometheus Operator format in deploy/observability/prometheus/operator/prometheusrule-dataplane.yaml. These rules precompute frequently used aggregations:

Rule	Description
`nantian_gateway_dataplane_ready_replicas`	Sum of ready data plane instances
`nantian_gateway_dataplane_targets`	Count of data plane instances
`nantian_gateway_dataplane_not_ready_replicas`	Count of not-ready instances
`nantian_gateway_dataplane_container_cpu_cores`	CPU usage per container (requires cAdvisor)
`nantian_gateway_dataplane_container_cpu_request_cores`	CPU request per container (requires kube-state-metrics)
`nantian_gateway_dataplane_container_cpu_throttle_ratio`	CPU throttle ratio (requires cAdvisor)
`nantian_gateway_dataplane_container_memory_working_set_bytes`	Memory working set (requires cAdvisor)
`nantian_gateway_dataplane_container_memory_limit_bytes`	Memory limit (requires kube-state-metrics)
`nantian_gateway_dataplane_container_memory_request_bytes`	Memory request (requires kube-state-metrics)

What’s next

Grafana Dashboard — import and customize the pre-built dashboard
Alerting Rules — alerting rules you should configure
Configuration: Observability — logging, metrics, and tracing configuration