Skip to content

Observability

Nantian Gateway emits structured logs, Prometheus metrics, and optional OpenTelemetry traces from both the control plane and data plane. This page covers the configuration of each observability signal.

Both components use structured logging with JSON as the default format. Structured logs integrate directly with log aggregation systems such as Loki, Elasticsearch, and Datadog.

The control plane uses Go’s log/slog package:

ParameterTypeDefaultDescription
log.levelstringinfoMinimum log level: debug, info, warn, or error
log.formatstringjsonOutput format: json or text
log.addSourceboolfalseInclude source file and line number in log entries

The debug level includes reconciliation details, configuration snapshot diffs, and gRPC stream events. This level generates substantially more output than info and should not remain enabled in production.

When log.addSource is enabled, each log entry includes the Go source file and line number. This adds a small overhead per log call but significantly aids debugging.

Each JSON log entry contains the fields time, level, msg, and context-specific attributes. Use the text format for local development where logs are read directly rather than processed by aggregation tools.

The data plane uses Rust’s tracing framework with fine-grained per-module control:

ParameterTypeDefaultDescription
log.levelstringinfo,nantian_core::connectors=offTracing filter directives
log.formatstringjsonOutput format: json or text
log.addSourceboolfalseInclude source location
log.includeTargetboolfalseInclude the tracing target (Rust module path)
log.includeThreadIdsboolfalseInclude thread IDs
log.includeThreadNamesboolfalseInclude thread names
log.nonBlockingbooltrueUse non-blocking log output
log.nonBlockingBufferedLinesint65536Ring buffer capacity
log.dropWhenFullbooltrueDrop logs when the buffer is full

The level field accepts tracing filter syntax for per-module granularity. For example, info,hyper=warn,tower=debug sets the global level to info, reduces hyper to warn, and increases tower to debug. The default suppresses verbose output from nantian_core::connectors.

Non-blocking logging is essential for a high-throughput proxy. A slow log sink (congested disk or network) can stall proxy threads when using blocking output. The ring buffer absorbs bursts, and dropWhenFull ensures the proxy continues operating if the buffer overflows.

Both planes expose Prometheus metrics on configurable endpoints.

The control plane exposes an HTTP metrics endpoint at the configured path (default: /metrics). Metrics include:

  • Reconciliation counters — count of Kubernetes resource events processed, errors encountered, and configuration snapshots generated
  • gRPC stream metrics — active connections, messages sent and received, and connection duration
  • Go runtime metrics — goroutine count, memory allocation, and GC statistics

The data plane exposes metrics on a dedicated HTTP listener:

ParameterTypeDefaultDescription
metrics.enabledbooltrueEnable the metrics endpoint
metrics.portint9100Port for the metrics HTTP server
metrics.pathstring"/metrics"HTTP path for the metrics endpoint

Key data plane metrics:

  • Request counters — total requests, grouped by HTTP method, status code, and route
  • Latency histograms — request duration percentiles (p50, p95, p99)
  • Connection metrics — active connections, connection rate, and connection duration
  • Upstream metrics — backend connection pool hits/misses, backend latency, and backend errors
  • AI gateway metrics — token counts, provider latency, and rate limit enforcement

For a complete metric reference, see Metrics Reference.

OpenTelemetry tracing is optional and can be enabled in the data plane configuration:

ParameterTypeDefaultDescription
tracing.enabledboolfalseEnable OpenTelemetry tracing
tracing.endpointstring""OTLP collector endpoint
tracing.sampleRatefloat0.1Sampling rate (0.0 to 1.0)
tracing.serviceNamestring"nantian-gw"Service name in trace data

When enabled, the data plane propagates trace context headers and exports spans to the configured OTLP collector. Traces include the full request lifecycle: TLS handshake, route matching, header transformation, upstream request, and response.

The sampleRate controls the fraction of requests that generate traces. A rate of 0.1 traces 10% of requests. For production, start with a low sampling rate and increase based on observability requirements and storage capacity.

To configure Prometheus to scrape the data plane metrics:

scrape_configs:
- job_name: nantian-gw-dataplane
kubernetes_sd_configs:
- role: pod
namespaces:
names: [nantian-gw]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
action: keep
regex: nantian-gw-dataplane
- source_labels: [__meta_kubernetes_pod_container_port_number]
action: keep
regex: "9100"