Control Plane Design

The control plane is the brain of Nantian Gateway. It watches Kubernetes resources, translates them into configuration the data plane can understand, manages the lifecycle of data plane connections, and writes status back to Gateway API resources. It runs as a Go binary built on controller-runtime and grpc-go.

Startup sequence

The control plane boots in a carefully ordered sequence to avoid serving partial configuration. The entry point is gateway/cmd/manager/main.go, which calls run() in app.go:

Load configuration from the YAML file specified by --config
Build the Kubernetes scheme registering all Gateway API types, custom CRDs, and infrastructure types
Create the controller-runtime manager with leader election, health probes, and a metrics server
Initialize core services: metrics registry, IR snapshot store, and node status registry backed by Kubernetes Leases
Create the translator with resource limits that protect against runaway translation
Create the status writer with the controller name and advertised addresses
Set up the reconciler runner with scoped reconcilers for infrastructure and status
Create the snapshot syncer that watches Kubernetes resources, feeds them to the translator, and publishes snapshots
Create the admin API server with optional TLS and bearer token authentication
Create the gRPC xDS server with TLS/mTLS and runtime configuration
Assemble all components under the lifecycle supervisor
Run the supervisor — all components start in parallel, the startup gate blocks readiness until everything is healthy

 Load Config  -->  Build Scheme  -->  Create Manager  -->  Init Services
                                                              |
                                                              v
                                                    Translator + Status + Reconciler
                                                              |
                                                              v
                                                    Syncer + Admin + gRPC
                                                              |
                                                              v
                                                    Supervisor (start all, gate readiness)

Translator

The translator lives at gateway/internal/translator/translator.go. Its primary method is Build(ctx, client), which reads all relevant Kubernetes resources through the controller-runtime client and produces a complete ir.Snapshot.

What it reads

The translator fetches these Kubernetes resource types in a single build cycle:

Gateway API resources: Gateway, ListenerSet, HTTPRoute, GRPCRoute, TCPRoute, UDPRoute, TLSRoute, ReferenceGrant, BackendTLSPolicy
Nantian custom resources: AIService, TokenPolicy, WasmPlugin, BackendLBPolicy
Kubernetes primitives: Service, Endpoint, EndpointSlice, Secret, ConfigMap, Namespace, Node
Multi-cluster: ServiceImport (from the MCS API)

How it builds

The translator follows a pipeline that processes resources in dependency order:

Gateways and listeners are read first. They define the ports, protocols, and TLS configuration that the data plane listens on
Routes are matched to gateways. Each route type (HTTPRoute, GRPCRoute, TCPRoute, UDPRoute, TLSRoute) is processed by its own translation function
Backend references are resolved. The translator follows backendRefs from routes to Services, EndpointSlices, and custom backends
Policies are applied. BackendTLSPolicy, BackendLBPolicy, TokenPolicy, and WasmPlugin are attached to the relevant route or listener
AI service configuration is translated. AIService resources are resolved into backend clusters with provider-specific configuration
Limits are enforced. The translator checks MaxInputObjects, MaxSnapshotObjects, and MaxSnapshotEndpoints and aborts if any limit is exceeded

Resource limits

The translator accepts configurable limits to prevent resource exhaustion:

Limit	Default	Description
`MaxInputObjects`	0 (unlimited)	Maximum total Kubernetes objects read during a build
`MaxSnapshotObjects`	0 (unlimited)	Maximum objects in the output IR snapshot
`MaxSnapshotEndpoints`	0 (unlimited)	Maximum endpoint entries in the output snapshot

When a limit is exceeded, the translator returns an error and the build is considered failed. The snapshot store retains the last successful snapshot, so data planes continue serving traffic with the previous configuration.

Indexes

The translator registers several field indexes with the controller-runtime manager to speed up resource lookups. These are defined in gateway/internal/translator/indexes.go and registered via SetupIndexes() during startup. Without these indexes, the translator would need to perform expensive full-list scans for every build.

Snapshot store

The snapshot store (gateway/internal/ir/store.go) is the central distribution point for translated configuration. It holds the current ir.Snapshot and manages a subscriber list. Each subscriber represents a data plane that needs to receive configuration updates.

When the translator publishes a new snapshot, the store replaces the current snapshot and fans it out to all subscribers. If a subscriber is still processing the previous snapshot, the store coalesces the pending snapshot, replacing the old one with the new one. This prevents backpressure from a slow data plane from blocking the translator.

The store exposes hooks for metrics collection. The OnSubscriberQueueReplace hook increments the nantian_gateway_controlplane_xds_snapshot_fanout_coalesced_total counter when a pending snapshot is replaced.

gRPC xDS server

The gRPC server (gateway/internal/grpcserver/server.go) implements the ConfigurationDiscoveryService gRPC service defined in proto/gateway/control/v1/. Data planes connect via a bidirectional streaming RPC and exchange DiscoveryRequest and DiscoveryResponse messages.

Stream lifecycle

A data plane opens a StreamConfiguration RPC and sends a DiscoveryRequest identifying itself by node ID
The server validates the request, records the stream in the active stream registry, and subscribes the node to the snapshot store
The server sends the current snapshot to the data plane
The data plane acknowledges with an ACK or NACK
When the snapshot store publishes a new snapshot, the server sends it to all active streams
The data plane sends status reports (DataplaneStatusReport) on the same stream, providing health and configuration state
The stream terminates when the data plane disconnects, the server shuts down, or a timeout occurs

Stream termination reasons

The server tracks why each stream ended and records it in the nantian_gateway_controlplane_xds_stream_terminations_total metric with the reason label:

Reason	Description
`shutdown`	Server is shutting down
`client_disconnect`	Data plane closed the stream
`stream_error`	gRPC stream encountered an error
`send_timeout`	Sending a snapshot timed out
`ack_timeout`	No ACK/NACK received within the timeout
`superseded`	A new stream from the same node replaced this one
`invalid_request`	The initial request was malformed
`other`	Any other reason

Status report handling

The gRPC server receives DataplaneStatusReport messages from data planes and forwards them to the node status registry. Reports are validated before being applied. Rejected reports are counted in the nantian_gateway_controlplane_xds_status_report_rejections_total metric with rejection reasons like shutdown, invalid_request, unknown_node, or other.

Reconciler runner

The control plane uses a custom reconciler runner (gateway/internal/controller/reconciler_runner.go) rather than the default controller-runtime reconciler loop. The runner supports:

Scoped reconciliation: Infrastructure and status are reconciled independently with different scopes
Settle delay: Changes are debounced to avoid excessive reconciliation during rapid resource updates
Retry with backoff: Failed reconciliations are retried with exponential backoff
Immediate trigger: Node status changes can trigger immediate infrastructure reconciliation

The runner emits detailed metrics for monitoring: queue depth, trigger counts, deduplication counts, settle state, and retry state.

Leader election

The control plane uses Kubernetes leader election through the controller-runtime manager. The leader election configuration is defined in the control plane config:

Parameter	Default	Description
`leaderElection.enabled`	`true`	Enable leader election
`leaderElection.id`	`nantian-gw-controlplane`	Lease identity
`leaderElection.leaseDuration`	`15s`	How long a lease is held
`leaderElection.renewDeadline`	`10s`	How long the leader can attempt to renew
`leaderElection.retryPeriod`	`2s`	How long candidates wait between acquisition attempts

Only the leader runs the translator, reconciler runner, and snapshot syncer. The standby replicas serve the Admin API and metrics endpoint but do not watch Kubernetes resources or build snapshots. If the leader fails, one of the standbys acquires the lease and takes over translation within the lease duration.

Status reporting

The status writer (gateway/internal/status/) writes status back to Gateway API resources. It updates:

Gateway status: Listener status (ready, warning, failed), addresses, and conditions
Route status: Parent gateway acceptance, route conditions, and resolved refs
Policy status: Ancestor references and conditions for BackendTLSPolicy, BackendLBPolicy, AIService, TokenPolicy, and WasmPlugin

The status writer is triggered by the reconciler runner after each infrastructure reconciliation. It uses the controller-runtime client to patch status subresources, respecting the standard Gateway API status conventions.

Node status

The node status system (gateway/internal/nodestatus/) tracks the health and configuration state of each data plane instance. It uses Kubernetes Lease objects for persistence, storing node status as JSON in the lease’s spec fields. The node status registry supports:

Debounced persistence: Status updates are batched and flushed after a configurable debounce window
Bounded backlog: The persistence queue has a configurable maximum depth to prevent unbounded memory growth
Immediate and debounced updates: Critical updates can bypass the debounce window

Node status metrics include queue depth, pending nodes, enqueued/dropped totals, and flush duration histograms.

Lifecycle supervisor

The lifecycle supervisor (gateway/internal/lifecycle/supervisor.go) manages the startup and shutdown of all control plane components. Components are registered with a name and a run function. The supervisor:

Starts all components in parallel
Waits for all components to signal readiness (or for a startup timeout)
Marks the startup gate as ready, allowing Kubernetes readiness probes to succeed
On shutdown, cancels the context for all components and waits for graceful termination

Components include the controller-runtime manager, admin HTTP server, metrics HTTP server, gRPC server, and optionally the pprof debug server.

What’s next

Data Plane Design — how the Rust proxy handles traffic
Admin API — API reference for the admin server
Configuration: Control Plane — all configuration parameters