Il tracciato enterprise per l'inferenza AI su Kubernetes dopo la WG Serving

The Enterprise Path for AI Inference on Kubernetes After WG Serving

2026-02-27

Il tracciato enterprise per l'inferenza AI su Kubernetes dopo la WG Serving

Sommario

Il 26 febbraio 2026, il Kubernetes Working Group Serving ha annunciato il proprio scioglimento dopo aver raggiunto l'obiettivo principale: consolidare Kubernetes come piattaforma di orchestrazione di prima scelta per i workload di inferenza. Per gli architect enterprise, questo non è un semplice cambio di governance interna alla CNCF, ma il segnale che la stack è matura abbastanza da essere standardizzata in ambienti di produzione a larga scala.

Dalla sperimentazione al piano di controllo

La nascita del WG Serving circa due anni fa rispondeva a un'esigenza concreta: Kubernetes gestiva già training e fine-tuning, ma l'inferenza in produzione presentava requisiti diversi — latenza, routing context-aware, gestione degli acceleratori, scaling reattivo ai token al secondo piuttosto che alle richieste HTTP. Il working group ha raccolto requisiti da provider hardware (NVIDIA, AMD), vendor di model server (vLLM, TensorRT-LLM, Triton) e operatori cloud, costruendo un modello condiviso del problema.

Il risultato non è un singolo componente monolitico, ma un insieme di primitive Kubernetes compostabili:

Gateway API Inference Extension (sponsorizzato da SIG Network): introduce InferencePool e InferenceModel come custom resource types. A differenza di un load balancer tradizionale che distribuisce le richieste round-robin, questo gateway è consapevole dello stato interno del model server — numero di richieste inflight, dimensione della KV cache attiva — e può eseguire prefix caching routing, indirizzando richieste con prompt simili verso la stessa replica per massimizzare il hit rate della cache.
LeaderWorkerSet (LWS) (SIG Apps): risolve il problema del multi-node, multi-GPU deployment. Un modello da 70 miliardi di parametri non entra in una singola A100; LWS gestisce il gruppo di nodi come un'unità coesa, con restart coordinato e failure domain precisi.
Dynamic Resource Allocation (DRA) (WG Device Management): sostituisce il modello rigido di nvidia.com/gpu: 1 con un meccanismo strutturato per richiedere classi di acceleratori, time-slicing e condivisione MIG. In un'infrastruttura con GPU A100, H100 e H200 miste, DRA permette allo scheduler di fare matching semantico tra requisiti del modello e hardware disponibile.
Kueue: job scheduling con supporto a quote per namespace, priorità e fair-sharing tra team. Imprescindibile in ambienti multi-tenant dove team di data science diversi condividono un cluster.

Il layer distribuito: llm-d e AIBrix

La WG Serving ha riconosciuto che i problemi più complessi — distributed prefill/decode, benchmarking sistematico, co-evolution model server/infrastruttura — necessitano di forum specializzati al di fuori del core Kubernetes.

Il progetto llm-d (Linux Foundation) ibrida l'ecosistema infrastrutturale con quello ML, fornendo pattern architetturali per disaggregare prefill e decode su nodi separati. In produzione, questo permette di ottimizzare separatamente il throughput del prefill (memory-bandwidth bound) e la latenza del decode (compute bound), con guadagni misurabili su workload con prompt lunghi — tipici nei workflow RAG enterprise.

AIBrix offre una soluzione di piattaforma completa per LLM serving cost-efficient, integrando autoscaling basato su metriche token-native, scheduling NUMA-aware e distribuzione delle repliche ottimizzata per il prefix caching.

Implicazioni pratiche per un architect enterprise

Per un team che vuole costruire una piattaforma AI inference enterprise-grade su Kubernetes, la stack di riferimento è ora chiara:

Kubernetes v1.35+ con DRA abilitato e profili di conformanza AI (k8s-ai-conformance)
Gateway API Inference Extension per il routing intelligente a livello applicazione
LWS per i modelli che richiedono più nodi
Kueue per il multi-tenancy e la governance delle risorse
llm-d o AIBrix come layer di serving sopra il cluster

In un contesto bancario, dove la latenza P99 e l'isolamento dei tenant sono vincoli non negoziabili, questa stack permette di operare modelli interni su GKE on-prem o cluster bare-metal con lo stesso livello di controllo applicato a qualsiasi workload critico. La differenza rispetto a soluzioni managed come Vertex AI è il controllo completo sulla data residency e sulla catena di dipendenze software, rilevante per audit e conformità normativa.

Conclusione

Lo scioglimento del WG Serving è un segnale di maturità, non di abbandono. Il lavoro continua nelle SIG appropriate, con governance distribuita e componenti stabili. Per gli architect enterprise, il messaggio è che la stack Kubernetes per AI inference non è più sperimentale: è pronta per i workload di produzione, con un percorso chiaro verso la conformance certificata CNCF e un insieme di progetti — Gateway API Inference Extension, LWS, DRA, Kueue, llm-d — che coprono ogni layer dello stack dall'acceleratore al routing HTTP.

Summary

On February 26, 2026, the Kubernetes Serving Working Group announced its dissolution after achieving its primary goal: solidifying Kubernetes as the orchestration platform of choice for inference workloads. For enterprise architects, this isn’t simply an internal CNCF governance change, but a signal that the stack is mature enough to be standardized in large-scale production environments.

From Experimentation to Control Plane

The birth of the WG Serving approximately two years ago responded to a concrete need: Kubernetes already handled training and fine-tuning, but production inference presented different requirements – latency, context-aware routing, accelerator management, reactive scaling to tokens per second rather than HTTP requests. The working group gathered requirements from hardware providers (NVIDIA, AMD), model server vendors (vLLM, TensorRT-LLM, Triton) and cloud operators, building a shared understanding of the problem.

The result isn’t a single monolithic component, but a set of composable Kubernetes primitives:

Gateway API Inference Extension (sponsored by SIG Network): introduces InferencePool and InferenceModel as custom resource types. Unlike a traditional load balancer that distributes requests round-robin, this gateway is aware of the internal state of the model server – number of inflight requests, size of the active KV cache – and can perform prefix caching routing, directing requests with similar prompts to the same replica to maximize cache hit rate.
LeaderWorkerSet (LWS) (SIG Apps): solves the problem of multi-node, multi-GPU deployment. A 70 billion parameter model doesn’t fit on a single A100; LWS manages the group of nodes as a cohesive unit, with coordinated restart and precise failure domains.
Dynamic Resource Allocation (DRA) (WG Device Management): replaces the rigid nvidia.com/gpu: 1 model with a structured mechanism for requesting accelerator classes, time-slicing and MIG sharing. In an infrastructure with mixed A100, H100 and H200 GPUs, DRA allows the scheduler to perform semantic matching between model requirements and available hardware.
Kueue: job scheduling with support for namespace quotas, priority and fair-sharing between teams. Essential in multi-tenant environments where different data science teams share a cluster.

The Distributed Layer: llm-d and AIBrix

The WG Serving recognized that the most complex problems – distributed prefill/decode, systematic benchmarking, model server/infrastructure co-evolution – require specialized forums outside of core Kubernetes.

The llm-d project (Linux Foundation) bridges the infrastructure and ML ecosystems, providing architectural patterns for disaggregating prefill and decode onto separate nodes. In production, this allows for separate optimization of prefill throughput (memory-bandwidth bound) and decode latency (compute bound), with measurable gains on workloads with long prompts – typical in enterprise RAG workflows.

AIBrix offers a complete platform solution for cost-efficient LLM serving, integrating autoscaling based on token-native metrics, NUMA-aware scheduling and replica distribution optimized for prefix caching.

Practical Implications for an Enterprise Architect

For a team wanting to build an enterprise-grade AI inference platform on Kubernetes, the reference stack is now clear:

Kubernetes v1.35+ with DRA enabled and AI conformance profiles (k8s-ai-conformance)
Gateway API Inference Extension for intelligent application-level routing
LWS for models requiring multiple nodes
Kueue for multi-tenancy and resource governance
llm-d or AIBrix as the serving layer above the cluster

In a banking context, where P99 latency and tenant isolation are non-negotiable constraints, this stack allows operating internal models on GKE on-prem or bare-metal clusters with the same level of control applied to any critical workload. The difference compared to managed solutions like Vertex AI is complete control over data residency and the software dependency chain, relevant for audits and regulatory compliance.

Conclusion

The dissolution of the WG Serving is a sign of maturity, not abandonment. The work continues in the appropriate SIGs, with distributed governance and stable components. For enterprise architects, the message is that the Kubernetes stack for AI inference is no longer experimental: it’s ready for production workloads, with a clear path towards certified CNCF conformance and a set of projects – Gateway API Inference Extension, LWS, DRA, Kueue, llm-d – that cover every layer of the stack from the accelerator to HTTP routing.