Observability
Nantian Gateway emits structured logs, Prometheus metrics, and optional OpenTelemetry traces from both the control plane and data plane. This page covers the configuration of each observability signal.
Logging
Section titled “Logging”Both components use structured logging with JSON as the default format. Structured logs integrate directly with log aggregation systems such as Loki, Elasticsearch, and Datadog.
Control Plane Logging (Go)
Section titled “Control Plane Logging (Go)”The control plane uses Go’s log/slog package:
| Parameter | Type | Default | Description |
|---|---|---|---|
log.level | string | info | Minimum log level: debug, info, warn, or error |
log.format | string | json | Output format: json or text |
log.addSource | bool | false | Include source file and line number in log entries |
The debug level includes reconciliation details, configuration snapshot diffs, and gRPC stream events. This level generates substantially more output than info and should not remain enabled in production.
When log.addSource is enabled, each log entry includes the Go source file and line number. This adds a small overhead per log call but significantly aids debugging.
Each JSON log entry contains the fields time, level, msg, and context-specific attributes. Use the text format for local development where logs are read directly rather than processed by aggregation tools.
Data Plane Logging (Rust)
Section titled “Data Plane Logging (Rust)”The data plane uses Rust’s tracing framework with fine-grained per-module control:
| Parameter | Type | Default | Description |
|---|---|---|---|
log.level | string | info,nantian_core::connectors=off | Tracing filter directives |
log.format | string | json | Output format: json or text |
log.addSource | bool | false | Include source location |
log.includeTarget | bool | false | Include the tracing target (Rust module path) |
log.includeThreadIds | bool | false | Include thread IDs |
log.includeThreadNames | bool | false | Include thread names |
log.nonBlocking | bool | true | Use non-blocking log output |
log.nonBlockingBufferedLines | int | 65536 | Ring buffer capacity |
log.dropWhenFull | bool | true | Drop logs when the buffer is full |
The level field accepts tracing filter syntax for per-module granularity. For example, info,hyper=warn,tower=debug sets the global level to info, reduces hyper to warn, and increases tower to debug. The default suppresses verbose output from nantian_core::connectors.
Non-blocking logging is essential for a high-throughput proxy. A slow log sink (congested disk or network) can stall proxy threads when using blocking output. The ring buffer absorbs bursts, and dropWhenFull ensures the proxy continues operating if the buffer overflows.
Metrics
Section titled “Metrics”Both planes expose Prometheus metrics on configurable endpoints.
Control Plane Metrics
Section titled “Control Plane Metrics”The control plane exposes an HTTP metrics endpoint at the configured path (default: /metrics). Metrics include:
- Reconciliation counters — count of Kubernetes resource events processed, errors encountered, and configuration snapshots generated
- gRPC stream metrics — active connections, messages sent and received, and connection duration
- Go runtime metrics — goroutine count, memory allocation, and GC statistics
Data Plane Metrics
Section titled “Data Plane Metrics”The data plane exposes metrics on a dedicated HTTP listener:
| Parameter | Type | Default | Description |
|---|---|---|---|
metrics.enabled | bool | true | Enable the metrics endpoint |
metrics.port | int | 9100 | Port for the metrics HTTP server |
metrics.path | string | "/metrics" | HTTP path for the metrics endpoint |
Key data plane metrics:
- Request counters — total requests, grouped by HTTP method, status code, and route
- Latency histograms — request duration percentiles (p50, p95, p99)
- Connection metrics — active connections, connection rate, and connection duration
- Upstream metrics — backend connection pool hits/misses, backend latency, and backend errors
- AI gateway metrics — token counts, provider latency, and rate limit enforcement
For a complete metric reference, see Metrics Reference.
Tracing
Section titled “Tracing”OpenTelemetry tracing is optional and can be enabled in the data plane configuration:
| Parameter | Type | Default | Description |
|---|---|---|---|
tracing.enabled | bool | false | Enable OpenTelemetry tracing |
tracing.endpoint | string | "" | OTLP collector endpoint |
tracing.sampleRate | float | 0.1 | Sampling rate (0.0 to 1.0) |
tracing.serviceName | string | "nantian-gw" | Service name in trace data |
When enabled, the data plane propagates trace context headers and exports spans to the configured OTLP collector. Traces include the full request lifecycle: TLS handshake, route matching, header transformation, upstream request, and response.
The sampleRate controls the fraction of requests that generate traces. A rate of 0.1 traces 10% of requests. For production, start with a low sampling rate and increase based on observability requirements and storage capacity.
Prometheus Scraping
Section titled “Prometheus Scraping”To configure Prometheus to scrape the data plane metrics:
scrape_configs: - job_name: nantian-gw-dataplane kubernetes_sd_configs: - role: pod namespaces: names: [nantian-gw] relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name] action: keep regex: nantian-gw-dataplane - source_labels: [__meta_kubernetes_pod_container_port_number] action: keep regex: "9100"Next Steps
Section titled “Next Steps”- Metrics Reference — complete reference of all available metrics
- Grafana Dashboard — pre-built dashboards for visualization
- Alerting Rules — recommended alerting configurations