Skip to content

Grafana Dashboard

Nantian Gateway ships with a pre-built Grafana dashboard at deploy/observability/grafana/pingora-gateway-observability-dashboard.json. The dashboard provides a comprehensive view of both control plane and data plane health, performance, and resource utilization.

  1. Navigate to Dashboards > New > Import
  2. Upload pingora-gateway-observability-dashboard.json or paste its contents
  3. Select your Prometheus data source from the dropdown
  4. Click Import

If you deploy Grafana through the kube-prometheus-stack or a similar operator, mount the dashboard JSON as a ConfigMap with the grafana_dashboard label:

apiVersion: v1
kind: ConfigMap
metadata:
name: nantian-gateway-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
nantian-gateway.json: |
{
"uid": "nantian-gw-obsv",
...
}

Apply it:

Terminal window
kubectl apply -f grafana-dashboard-configmap.yaml

The Helm chart at deploy/helm/nantian-gw includes the dashboard ConfigMap. Enable it in your values:

grafana:
dashboards:
enabled: true

The dashboard is organized into these rows:

Top-level metrics in a single row for at-a-glance monitoring:

PanelMetricWhat it shows
Ready Podsnantian_gateway_dataplane_readyNumber of data plane pods ready to serve traffic
QPSnantian_gateway_dataplane_traffic_request_events_totalRequests per second across all data planes
Success RateResponse flags ratioPercentage of requests with no error flags
P99 Latencynantian_gateway_dataplane_traffic_request_latency_ms99th percentile request latency

Panels monitoring the control plane’s internal operations:

PanelKey metrics
Snapshot buildsnantian_gateway_snapshot_builds_total, nantian_gateway_snapshot_build_failures_total
Build durationnantian_gateway_snapshot_build_duration_seconds
Snapshot resource countsnantian_gateway_snapshot_resource_count
xDS stream statusnantian_gateway_controlplane_xds_stream_terminations_total
xDS publish lagnantian_gateway_controlplane_xds_publish_ack_lag_seconds
Admin API requestsnantian_gateway_controlplane_admin_requests_total
Reconciler runnernantian_gateway_controlplane_reconciler_runner_runs_total, queue depth, settle state

Panels covering proxy performance and backend health:

PanelKey metrics
HTTP request ratenantian_gateway_dataplane_http_requests_total by listener and route
Request latencynantian_gateway_dataplane_http_request_duration_seconds
Backend connectionsnantian_gateway_dataplane_backend_connections_active
Backend healthnantian_gateway_dataplane_backend_health_status
Connection errorsnantian_gateway_dataplane_backend_connection_errors_total
Response status distributionnantian_gateway_dataplane_http_responses_total by status code

Container resource utilization panels (requires cAdvisor and kube-state-metrics):

PanelRecording rules used
CPU usagenantian_gateway_dataplane_container_cpu_cores
CPU throttlenantian_gateway_dataplane_container_cpu_throttle_ratio
Memory working setnantian_gateway_dataplane_container_memory_working_set_bytes
Memory vs limitsnantian_gateway_dataplane_container_memory_limit_bytes

The dashboard defines these template variables for filtering:

VariableTypeSourceDescription
datasourceData sourceUser selectionPrometheus data source
namespaceQuerynantian_gateway_dataplane_readyKubernetes namespace (default: nantian-gw)
pod_dpQuerynantian_gateway_dataplane_readyData plane pod selector (multi-select, default: all)
job_controlplaneQuerynantian_gateway_controlplane_admin_requests_totalControl plane job selector
instance_cpQuerynantian_gateway_controlplane_admin_requests_totalControl plane instance selector
job_dataplaneQuerynantian_gateway_dataplane_admin_requests_totalData plane job selector
instance_dpQuerynantian_gateway_dataplane_admin_requests_totalData plane instance selector

The dashboard is a standard Grafana JSON model. You can customize it through the Grafana UI or by editing the JSON directly.

  1. Click Add > Visualization in the desired row
  2. Select a visualization type (time series, stat, gauge, table, etc.)
  3. Write a PromQL query in the query editor
  4. Configure the panel title, legend, and thresholds
  5. Save the dashboard

The dashboard refreshes every 30 seconds by default. You can change this in the dashboard settings or by modifying the refresh field in the JSON:

{
"refresh": "10s"
}

You can create Grafana alert rules directly from dashboard panels. Select a panel, click Alert > Create alert rule from this panel, and configure the threshold and notification channel.

Check that Prometheus is scraping both the control plane and data plane metrics endpoints:

Terminal window
# Check Prometheus targets
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job | startswith("nantian"))'
# Test control plane metrics directly
curl -s http://<control-plane-ip>:18082/metrics | head -20
# Test data plane metrics directly
curl -s http://<data-plane-ip>:<admin-port>/metrics | head -20

The resource panels depend on cAdvisor and kube-state-metrics. Verify these are running:

Terminal window
kubectl get pods -n monitoring | grep -E "cadvisor|kube-state-metrics"

The dashboard defaults to the nantian-gw namespace. If you deployed the gateway in a different namespace, update the namespace variable in the dashboard settings or change the default value in the JSON.