Observability

Nantian Gateway emits structured logs, Prometheus metrics, and optional OpenTelemetry traces from both the control plane and data plane. This page covers the configuration of each observability signal.

Logging

Both components use structured logging with JSON as the default format. Structured logs integrate directly with log aggregation systems such as Loki, Elasticsearch, and Datadog.

Control Plane Logging (Go)

The control plane uses Go’s log/slog package:

Parameter	Type	Default	Description
`log.level`	string	`info`	Minimum log level: `debug`, `info`, `warn`, or `error`
`log.format`	string	`json`	Output format: `json` or `text`
`log.addSource`	bool	`false`	Include source file and line number in log entries

The debug level includes reconciliation details, configuration snapshot diffs, and gRPC stream events. This level generates substantially more output than info and should not remain enabled in production.

When log.addSource is enabled, each log entry includes the Go source file and line number. This adds a small overhead per log call but significantly aids debugging.

Each JSON log entry contains the fields time, level, msg, and context-specific attributes. Use the text format for local development where logs are read directly rather than processed by aggregation tools.

Data Plane Logging (Rust)

The data plane uses Rust’s tracing framework with fine-grained per-module control:

Parameter	Type	Default	Description
`log.level`	string	`info,nantian_core::connectors=off`	Tracing filter directives
`log.format`	string	`json`	Output format: `json` or `text`
`log.addSource`	bool	`false`	Include source location
`log.includeTarget`	bool	`false`	Include the tracing target (Rust module path)
`log.includeThreadIds`	bool	`false`	Include thread IDs
`log.includeThreadNames`	bool	`false`	Include thread names
`log.nonBlocking`	bool	`true`	Use non-blocking log output
`log.nonBlockingBufferedLines`	int	`65536`	Ring buffer capacity
`log.dropWhenFull`	bool	`true`	Drop logs when the buffer is full

The level field accepts tracing filter syntax for per-module granularity. For example, info,hyper=warn,tower=debug sets the global level to info, reduces hyper to warn, and increases tower to debug. The default suppresses verbose output from nantian_core::connectors.

Non-blocking logging is essential for a high-throughput proxy. A slow log sink (congested disk or network) can stall proxy threads when using blocking output. The ring buffer absorbs bursts, and dropWhenFull ensures the proxy continues operating if the buffer overflows.

Metrics

Both planes expose Prometheus metrics on configurable endpoints.

Control Plane Metrics

The control plane exposes an HTTP metrics endpoint at the configured path (default: /metrics). Metrics include:

Reconciliation counters — count of Kubernetes resource events processed, errors encountered, and configuration snapshots generated
gRPC stream metrics — active connections, messages sent and received, and connection duration
Go runtime metrics — goroutine count, memory allocation, and GC statistics

Data Plane Metrics

The data plane exposes metrics on a dedicated HTTP listener:

Parameter	Type	Default	Description
`metrics.enabled`	bool	`true`	Enable the metrics endpoint
`metrics.port`	int	`9100`	Port for the metrics HTTP server
`metrics.path`	string	`"/metrics"`	HTTP path for the metrics endpoint

Key data plane metrics:

Request counters — total requests, grouped by HTTP method, status code, and route
Latency histograms — request duration percentiles (p50, p95, p99)
Connection metrics — active connections, connection rate, and connection duration
Upstream metrics — backend connection pool hits/misses, backend latency, and backend errors
AI gateway metrics — token counts, provider latency, and rate limit enforcement

For a complete metric reference, see Metrics Reference.

Tracing

OpenTelemetry tracing is optional and can be enabled in the data plane configuration:

Parameter	Type	Default	Description
`tracing.enabled`	bool	`false`	Enable OpenTelemetry tracing
`tracing.endpoint`	string	`""`	OTLP collector endpoint
`tracing.sampleRate`	float	`0.1`	Sampling rate (0.0 to 1.0)
`tracing.serviceName`	string	`"nantian-gw"`	Service name in trace data

When enabled, the data plane propagates trace context headers and exports spans to the configured OTLP collector. Traces include the full request lifecycle: TLS handshake, route matching, header transformation, upstream request, and response.

The sampleRate controls the fraction of requests that generate traces. A rate of 0.1 traces 10% of requests. For production, start with a low sampling rate and increase based on observability requirements and storage capacity.

Prometheus Scraping

To configure Prometheus to scrape the data plane metrics:

scrape_configs:
  - job_name: nantian-gw-dataplane
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: [nantian-gw]
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        action: keep
        regex: nantian-gw-dataplane
      - source_labels: [__meta_kubernetes_pod_container_port_number]
        action: keep
        regex: "9100"

Next Steps

Metrics Reference — complete reference of all available metrics
Grafana Dashboard — pre-built dashboards for visualization
Alerting Rules — recommended alerting configurations