pacoxu — GeistHaus

One Diagram for Model Distribution: Hugging Face, MatrixHub, Harbor, Dragonfly, ModelPack, and ModelExpress

22号_马修 Apr 28, 2026 Updated Apr 28, 2026

This page puts several frequently mixed-up projects on […]

Show full content

This page puts several frequently mixed-up projects on a single diagram. The goal is to separate the model source, private registry, cluster distribution, and runtime acceleration layers.
As matrixhub is not published yet, you may try v0.0.2-rc.7. This is a preview of matrixhub and comparison of solutions like dragonfly + ModelPack + harbor and dynamo modelexpress.

The Stack in One Diagram

Read the Diagram by Role

See https://github.com/pacoxu/AI-Infra/blob/main/docs/inference/model-distribution-stack.md#the-stack-in-one-diagram for tech details.

Provider / server view: The blue lane is the Docker image / OCI artifact path. Harbor is easiest to read here as a local Docker Hub / Distribution style private registry. The orange lane is the model distribution path, with Hugging Face, ModelScope, and MatrixHub on that side.
Download view: MatrixHub exposes an HF-compatible pull path. Dragonfly handles node-level file distribution and can serve OCI pulls from Harbor as well as hf:// and modelscope:// downloads.
End user / runtime view: Model files first land in node-local caches, then feed GPU workers. ModelExpress sits later in the path and accelerates weight reuse between workers, including cross-node GPU transfers over RDMA.

Line colors also carry meaning:

Orange links: HF-compatible or public model hub download paths
Blue links: OCI pull paths
Grey node-to-node links: Dragonfly node-level file chunk propagation
Green GPU-to-GPU links: runtime weight sharing paths relevant to ModelExpress

Focused Reference Diagrams

1. Dragonfly path: Harbor plus public model hubs

2. MatrixHub path: private Hugging Face style access

3. ModelExpress path: runtime weight sharing after initial pull （not quite familiar with this, correct me if I am wrong）

Read the Diagram from Left to Right 1. Hugging Face

Hugging Face is the public upstream model hub. It is the default source for many training and inference workflows using huggingface_hub, transformers, vLLM, and similar clients.

2. Private Hugging Face

Private Hugging Face is a target state, not a single product. It means:

private model hosting
access control and governance
low-friction compatibility with existing HF-style workflows
predictable distribution inside enterprise or air-gapped environments

3. MatrixHub

MatrixHub is the most direct path to that target state in this stack. It acts as an HF-compatible private hub, so teams can keep the Hugging Face interaction model while moving to a governed internal endpoint.

In practice, MatrixHub is the layer for:

private model registry and lifecycle governance
transparent HF proxy behavior
on-demand caching from public Hugging Face
multi-region or air-gapped distribution workflows

4. ModelPack + Harbor + Dragonfly

This path is different. It is OCI-first, not HF-first.

ModelPack provides a packaging/spec path for OCI-based model artifacts.
Harbor provides the private OCI registry, including enterprise governance features such as RBAC, signing, replication, and retention. A useful mental model is to treat it as an enterprise-local Docker Hub / Distribution style system with stronger management features.
Dragonfly accelerates distribution from the registry to nodes using preheat and P2P transfer patterns.

This stack is a strong answer to private model artifact management, but it does not by itself provide a native Hugging Face-compatible endpoint.

5. ModelExpress

ModelExpress sits later in the path. It is not the primary model hub. Its main job is runtime weight movement and cold-start reduction inside the cluster.

That usually means:

coordinating cache usage in the inference cluster
reducing repeated model pulls and loads
enabling worker-to-worker transfer
accelerating the last mile from storage or cache toward serving workers

The official documentation focuses on in-cluster multi-node coordination rather than a global multi-cluster control plane.

The Most Common Architecture Patterns Pattern A: Public Hugging Face

Use this when convenience matters more than control.

Clients -> Hugging Face

Tradeoff:

simplest workflow
least governance
repeated public downloads
weak fit for air-gapped or regulated environments

Pattern B: Private Hugging Face with MatrixHub

Use this when existing HF workflows should remain almost unchanged.

Clients -> MatrixHub -> Hugging Face or private storage

Tradeoff:

lowest migration cost for HF-first teams
strong fit for internal mirroring and governance
less aligned with OCI-first platform standardization than Harbor

Pattern C: Private Model Registry with Harbor + ModelPack + Dragonfly

Use this when the platform is already centered on OCI artifacts and Kubernetes.

Build/package -> ModelPack -> Harbor -> Dragonfly -> cluster nodes

Tradeoff:

strong standardization and enterprise controls
clean fit for OCI-native platform teams
more workflow translation if users expect native HF semantics

Pattern D: MatrixHub + ModelExpress

Use this when you need both private Hugging Face-style access and faster cluster runtime loading.

Clients -> MatrixHub -> cluster cache/source -> ModelExpress -> workers

Division of responsibility:

MatrixHub is the upstream system of record and governed distribution layer.
ModelExpress is the in-cluster runtime acceleration layer.

This is especially natural in multi-cluster environments where each cluster runs its own runtime acceleration path while a shared upstream model source keeps versions and access policies consistent.

Quick Positioning Table

ComponentPrimary layerBest forHugging FacePublic upstream hubPublic model discovery and default client workflowsPrivate Hugging FaceCapability / target stateInternal HF-like experienceMatrixHubPrivate model hubHF-compatible internal distribution and governanceModelPackPackaging/specOCI-based model artifact definitionHarborPrivate registryOCI artifact governance and replicationDragonflyCluster distributionLarge-scale node-level pull accelerationModelExpressRuntime accelerationIn-cluster cold-start and weight transfer optimization Practical Rule of Thumb

If the question is “where should models live and be governed?”, think MatrixHub or Harbor.
If the question is “do we want HF-compatible developer experience or OCI-first artifact workflows?”, choose between MatrixHub and Harbor + ModelPack.
If the question is “how do we reduce cluster cold-start and repeated weight movement?”, think Dragonfly and ModelExpress.
If the question is “how do we keep HF-like access while improving last-mile runtime loading?”, combine MatrixHub with ModelExpress.

References

http://pacoxu.wordpress.com/?p=1581

Extensions

Kubernetes Pod Startup Speed Optimization Guide

22号_马修 Jan 30, 2026 Updated Mar 3, 2026

Pod startup speed is often overlooked in cloud-native e […]

Show full content

Pod startup speed is often overlooked in cloud-native environments, yet its impact extends across multiple dimensions of system performance and cost. Consider a scenario where traffic suddenly surges and the auto-scaling system needs to quickly spin up new Pods to handle the load. If each Pod takes tens of seconds or even minutes to become fully operational, those incoming requests during the startup window will likely be dropped, degrading user experience. This isn’t merely a performance issue—it’s a cost issue, as idle compute resources consume expenses every single second.

Why Pod Startup Speed Matters

Pod startup performance touches several critical concerns. First is the need for rapid scaling. When applications require autoscaling, startup speed determines how quickly the system can respond to traffic fluctuations. Second is resource efficiency. Faster startup means less wasted idle resources. Third is user experience, especially in serverless architectures where cold start latency directly impacts the application response time users perceive. Finally, from a cost perspective, reducing Pod startup time significantly lowers infrastructure expenses.

The Four Key Points of Pod Startup

Understanding the various stages of Pod startup is essential to optimization efforts. The Pod startup process can be divided into four major phases: API Server processing, scheduling, pod startup on node, and application start-to-ready.

The API Server processing phase involves Pod object creation, validation, and persistence. During this phase, the control plane must handle the request, execute admission control policies, and write the Pod object to etcd. While typically fast, this process can become a bottleneck in high-concurrency scenarios.

The scheduling phase spans from Pod creation to being scheduled on a specific node. The scheduler must evaluate all available nodes and select the most suitable target. The duration depends on cluster size and scheduler configuration. In large-scale clusters, this can become a significant source of latency.

The node startup phase encompasses pulling images, creating containers, and starting processes on the selected node. This is usually the longest phase in the entire Pod startup process. It includes network image pulls, storage volume initialization, application startup, and health check completion.

The ready phase, while not strictly part of the “startup” process, affects how the system perceives Pod readiness. If health checks are misconfigured, a Pod might be running but considered unready, affecting overall startup time metrics.

Optimizing the API Server Processing Stage

At the API Server level, the focus is on improving throughput and reducing latency. A straightforward but effective optimization is adjusting the API Server’s concurrent request handling capacity. Increasing the --max-requests-inflight and --max-mutating-requests-inflight parameters allows the API Server to handle more Pod creation requests simultaneously.

Another crucial optimization is streamlining admission controllers. Some controllers might perform expensive operations, such as accessing external services or executing complex validations. Consider disabling unnecessary admission controllers or configuring them for maximum efficiency. Similarly, ensuring excellent etcd performance is vital, as the API Server ultimately must persist Pod objects to etcd.

Optimizing the Scheduling Phase

The scheduler’s performance directly impacts the time from Pod creation to scheduling. Leveraging various optimization techniques provided by the scheduling framework can accelerate this process. For instance, the pre-filter and filter phases can quickly eliminate unsuitable nodes, reducing the number of candidates for subsequent scoring phases.

Another key optimization involves judiciously using node affinity and Pod affinity rules while avoiding overly complex rules that increase scheduling latency. Additionally, for specific workloads, using priority and preemption features can ensure critical Pods are scheduled faster.

In large-scale clusters, consider deploying multiple scheduler instances to distribute the load. Kubernetes natively supports running multiple scheduler instances concurrently, which can significantly boost scheduling throughput.

Optimizing the Node Startup Phase

This phase offers the most substantial optimization opportunities. First, image pulling is a major bottleneck. Image warming is a proven optimization strategy. Pre-pull commonly used images to nodes during startup or scheduled maintenance windows. When Pods actually launch, they won’t need to fetch images from remote registries, drastically reducing startup time.

There are many ways to improve image pull speed. These include best practices for reducing image size, configuring Kubelet to enable concurrent image pulls, and adopting P2P distribution (such as Dragonfly or Uber/Kraken) or lazy loading mechanisms (such as stargz-snapshotter and Nydus). How to accelerate Pod startup under large-scale concurrency has been discussed in the context of VKE’s practices using Dragonfly and Nydus.

Container runtime performance is also critical. Different runtimes (such as containerd and Docker) have different performance characteristics. containerd is generally considered more lightweight and efficient, especially at large scale. Regularly upgrading the runtime to the latest version can also bring performance improvements. Slow docker create / docker run behavior is very common: it is usually caused by too many containers or images on a node (which can be mitigated by scheduled cleanup jobs). The legacy devicemapper storage driver is significantly slower in scenarios with frequent container creation and deletion; overlayfs performs somewhat better. In addition, older Docker versions contain many minor bugs that can cause docker ps or docker run to hang.

Application startup time itself also deserves attention. Some applications perform a large amount of initialization work at startup, such as database migrations or cache warm-up. Making these tasks asynchronous or deferring them until after the application has started can significantly reduce perceived startup latency.

CPU throttling during startup is another common issue. For a detailed discussion of CPU throttling in Kubernetes, possible mitigations include increasing CPU limits, pinning CPUs, or bypassing the issue via VPA-based approaches. (For cold-start scenarios with VPA, see: After waiting six years, Kubernetes 1.35 finally reaches GA with in-place resizing, boosting Java startup speed by 70%!)

Init containers can be used to perform necessary setup before the main container starts. However, init containers run sequentially, so it is important to avoid excessive initialization steps. Only perform what is strictly necessary and parallelize where possible. If feasible, use PostStart hooks to avoid delaying the main container startup. Complete preparation as early as possible: can these tasks be done during image build time? Can they be handled by a DaemonSet or Pod on the node? All of these approaches help reduce the preparation time before the Pod’s main container starts.

Optimizing Observability and Health Checks

Health check configuration has a significant impact on startup time metrics. If a StartupProbe is configured, the initialDelaySeconds of the ReadinessProbe should be reduced.

If no StartupProbe is configured, an appropriate initialDelaySeconds must be set for the ReadinessProbe.

At the same time, health checks must remain sufficiently strict to ensure that unhealthy Pods are not mistakenly considered ready.

Startup probes allow applications enough time to complete initialization without being killed or restarted during the startup phase.

Checkpointing and Snapshots

The Kubernetes community is exploring the use of checkpointing techniques to accelerate Pod startup. Checkpointing allows the state of a running container to be saved and later restored quickly, thereby skipping the application’s normal startup process. This is particularly beneficial for applications with long startup times.

For example, CRIU (Checkpoint/Restore In Userspace) has been integrated into the container runtimes used by Kubernetes. By saving a container’s state at a certain point in time—including memory and filesystem state—it can be rapidly restored when needed, effectively enabling a “warm start.” This approach is especially promising for serverless computing and batch workloads.

Kubelet configuration

Kubernetes 1.27: updates on speeding up Pod startup

https://kubernetes.io/blog/2023/05/15/speed-up-pod-startup/

This post points out a common issue in versions prior to 1.27, where Pods could start slowly and node-side events such as volume mount failures would occur.

To speed up Pod startup on nodes hosting multiple Pods—especially during sudden scale-up or scale-down events—the kubelet needs to synchronize Pod state and prepare ConfigMaps, Secrets, or volumes. This requires high-bandwidth access to the kube-apiserver.

In versions prior to v1.27, the default value of kubeAPIQPS was 5 and kubeAPIBurst was 10. Starting from v1.27, to improve Pod startup performance, the kubelet increased these defaults to 50 and 100 respectively. It is worth noting that raising the kubelet API QPS limits is not the only factor contributing to the performance improvement.

Comprehensive Optimization Strategy

Pod startup optimization isn’t an isolated effort but requires a systematic, layered approach. From the API Server to the scheduler, container runtime, and application layer, every level offers optimization opportunities.

Establishing clear Pod startup time metrics is an essential first step. Clearly defining what constitutes startup time (from Pod creation to container running, or to Pod readiness?) is important. Using Prometheus or other monitoring tools to collect detailed startup metrics helps identify where the real bottlenecks are.

Priorities differ based on specific business needs and cluster characteristics. For high-traffic services requiring rapid scaling, image warming and startup probe tuning might yield the best results. For applications with long startup times, checkpoint technology might provide more value. For large-scale clusters, scheduler performance optimization and multiple scheduler instances might be key.

Finally, remember that optimization is a continuous process. Regularly reviewing and testing new optimization strategies, along with performance improvements from new Kubernetes versions, ensures your cluster maintains optimal performance.

Related Resources

【KubeCon China 2023】How Can Pod Start-up Be Accelerated on Nodes in Large Clusters? – Paco Xu, DaoCloud & Byron Wang https://www.youtube.com/watch?v=UfjSphSD1Uk&pp=2AYD

https://github.com/pacoxu/AI-Infra/blob/main/docs/kubernetes/pod-lifecycle.md

http://pacoxu.wordpress.com/?p=1566

Extensions

2025年终小结

22号_马修 Dec 25, 2025 Updated Dec 26, 2025

工作篇（社区）比较认可这张图，确实对 AI ML workloads 的支持给了 Kubernetes 再 […]

Show full content

工作篇（社区）

社区 kube、CNCF
- 1.33 – 1.35（Focusing on AI）
  - Restart Container/Pod 策略， Gang Scheduling， DRA，Pod Level Resource Management，sla based scheduling(taint/toleration)
  - Node Readiness Controller
- Steering 连任，感谢大家支持 Announcing the 2025 Steering Committee Election Results
- Why CNCF TAGs are the core of cloud native innovation (and where to find them at KubeCon Atlanta) 参与到 CNCF 新的 TAG Workloads Foundation，Scope其实很大，目前还没重要产出，目前 Batch 和 Agentic 的讨论更多一些，之前的事情主要是调度白皮书部分。
- Kubernetes 社区的发展来看，KCD 主要介绍了这部分内容，新的多个 AI 相关的 WorkGroup 内容，以及新的项目或者 AI相关的项目：Kueue、JobSet、GAIE、Agent Sadnbox、Kube Agentic Gateway 等。
- 新尝试：最近关注点在推理，一方面学习 vLLM（目前主要在看 Blog、 Office Hours ，代码还没怎么看）、此外还关注了下 Ray、PyTorch；另一个重点话题就是 AI 编排，简单总结了下推理编排方案如何选择？AIBrix or Kthena or Dynamo?这块还没形成标准，但是因为这块其实很薄，更多的是如何和调度器等生态整合，而不是标准化。
- 尝试了一下开源之夏，前期收到了三份申请书，当时还是很开心的，看到申请书也很用心。但是执行过程有点无语，感觉流程本身可以完成，流程是完成了，但是看下来，除了提交申请书以及提交代码，其他时间的互动几乎为零。可以理解大家都很忙，但是总让人觉得很奇怪，希望活动能越来越好。

比较认可这张图，确实对 AI ML workloads 的支持给了 Kubernetes 再次启航的机会，但是目前的趋势来看，kubernetes 社区更多的是定义 API，定义标准的 CRD，进行简单的最佳实践。目前 Gateway API、Gateway API Inference Extension、Gang API（Workloads），总体来说这部分（kube社区内）可探索空间依然不大，更多需要 Co-Evolving 协同演进到上下游项目中。这个和 cgroups v2 的内核更新与 Kubelet 相关功能更新很类似。

会议
- 比较开心就是5个开源团队的成员都来到伦敦，也算是“团建”了（去年在香港）。
- 除了之前香港 Keynote 之外，现场人数最多的一次，非常大的会议室坐满了，当然感觉很多人是冲着 vCluster 来的，不过总体感觉很好。 A Huge Cluster or Multi-Clusters? Identifying the Bottleneck – Paco Xu & Saiyam Pathak

继续社区的推广 Kubernetes New Contributor Orientation – Paco Xu, DaoCloud; ZhenYu Jiang & Mengjiao Liu, Independent

让更多人了解 Kube 的新动向，也更有兴趣来参与。Keynotes-2 Kubernetes 社区新动向：AI Gateway, Integration 与 Conformance 工作组建立
- KCD 杭州给我最大感受就是 Kubernetes 对 AI 方向来说似乎有巨多事情需要考虑，但也可以甩出去，就看如何抉择。

公众号：
- 公众号发文大概分几个系列：
  - 大集群 & 多集群方案：延续了 KubeCon 欧洲的主题，a huge cluster or multi clusters，翻译了 GKE 130k 和 65k（去年）的文章，翻译了 AWS 文章，汇总了字节 KubeWharf 体系的完整方案（开源版本）以及蚂蚁的集群方案和优化。多集群没太多涉及。阅读量来看还是这部分最吸引人。
  - 隔离方案：AgentSandbox 项目为主，也简单研究了下 Kata、gVisor、vArmor 原理。
  - AI 方向主要看的是 vLLM 以及相关的推理编排部分，此外还看了下 Ray 社区总结的 Co-Evolving，协同演进的项目越来越多 PARK PyTorch + AI + Ray + Kubernetes 这个提法挺有趣的。
  - 差不多刚好发了一个月，基本库存清理干净了，未来能做到双周更新就不错了。涨粉700，远超预期了。分享1千+是没想到的。

Repo
- 今年年底回顾了下之前看到的各种焦点，https://github.com/pacoxu/AI-Infra 也一直在学习 AI Infra 的内容。上面公众号很多文章就是来自之前 AI Infra 学习过程的积累。当然也发现，这个 AI Infra 定义似乎并不被很多人接受，尤其是本来做 AI 方向的，这里的概念更像是 Cloud Native AI Infra 的概念。

生活篇

目前今年没太多出门，去了趟广州，青岛，苏州和杭州（KCD），KubeCon 去了趟香港和伦敦。

挺喜欢广州的饮茶文化，感觉人精神了很多。青岛的海滩真不错，苏杭也是很好的周末遛娃去处。

娃2岁开始围棋入门是不是有点早；带娃变成主旋律，但是面对“不要不要”的娃，耐心温柔实在是太难了，尤其是遇到一些赶时间或者意外的时候，实在是太难了。

今年感觉完全没看书，却萌生了未来和女儿一起看哈利波特书的想法。

AI 生成绘本一直没开始，上半年的尝试感觉效果不太好，最近的 Gemini 应该可以满足我的想法。

最后用劝别人的话劝劝自己：很多事情收集完整信息后就可以做决定了。收集信息的时间可以拉长，做决定的时间需要控制，才能留出时间去完成它；last but not least 落子无悔。有耐心的去坚持自己的选择（反悔的成本一般都不低，但都可以，所以需要坚持到一定程度再谈放弃）。纠结只能说明你有选择权。

回顾过之前的很多次经历，对这张图的感受也越来越真切了。很多事情，只是自己多主动了一点点去争取，就做到了；也有很多事情，当时没主动去做，结果就错失机会了。

兴趣篇

足球：
- 今年踢球 28次，比去年强太多了，今年总体状态有所回升，体重算是维持在了80kg左右，没有持续肥胖，但是健康状况仍然不容乐观，鼻炎感冒频率仍然很高。今年似乎是我睡眠最差的一年，以前重来没有入睡障碍，今年经常半夜醒了就睡不着了。另外有人拍视频之后，踢球比之前更认真了，。
- 伦敦 KubeCon 一周看了四场英超，也算是疯狂了一把，记忆最深刻还是富勒姆 3:2 利物浦。今年看直播也比往年多一些，一方面英超时间确实更友好，另一方面确实更好看一些，之前看西甲的时间真的很魔鬼，而且今年瓦伦卖掉刚培养好的中卫后又不行了。利物浦今年很煎熬，但是慢慢能看到有些改变，但是现在伊萨克伤了，感觉这赛季主打一个不顺。永远怀念20号若塔。萨拉赫的历史地位其实挺高的，也没想到来这么一出，希望能换战术后重生，毕竟世界杯年，去年开年没多久斋月后就感觉状态完全没了，现在也没找到，希望在国家队找回点自信，俱乐部也能在这段时间把战术磨到60分就可以。

PUBG 弃坑，今年国内战队打得实在是太差了，越来越看不到啥希望。
围棋，这块可能还是需要大量的训练以及耐心计算，在目前的节奏里面很难静下心来下棋，当然也是水平太差。
播客：开车干不了别的，基本靠听播客打发时间了，一个利物浦球迷在这听阿森纳球迷唠嗑。

心境

AI 焦虑，一方面来自 AI 发展的速度日新月异，新技术文章和项目如雨后春笋。学习速度远远跟不上 AI 发展速度，而且能感觉到越拉越多。当然可以借助 AI 加速学习过程，在公众号实践中，我其实就是我来收集材料和截图（或者生成一些配图），给AI相关的核心关键链接和内容，让 AI 组织语言，能非常快的发布一篇文章。

另一方面，发现 AI 巨头尤其是模型侧和公有云部分，很多优化会被规模放大，新的模型基本被巨头垄断了，而不仅仅在模型方面，在其他AI细分领域，小规模会越来越弱势。尤其是 MCP、A2A 等工具链和 workflow 引入之后，Agent 能力得到了很好的扩展，越来越多的领域会出现更高的“墙”。而这个过程中，可以看到有可能被取代的工作越来越多，AI编程一方面压缩了初级程序员的生存空间，一方面也在给产品经理或者非程序员更多想象力，也许这个“泛程序员”市场会被放大也未可知。希望 AI 能创造出更多职业和新的动力，而不是毁灭掉更多。

如何适应 AI 时代？有个初级的想法，就是做一个高效的 Agent：高效的 Agent 就是一个高效的人类智能体，

Routing：问题识别和分解，把一些事情快速的转发给合适的人去做，拆分任务排好优先级，部分活也可以“外包”出去。
- 避免频繁切换 Context
- 阶段性输出一些内容，可以让自己更好的量化自己
A2A：和其他高性价比的 Agent （如何筛选和标注）工作；另外就是 Agent 世界应该不能依赖workflow 和指定的人对接，而是每个人都是 Boss，你需要找其他Agent“付费”做事情。找不到事情做的 Agent 天然被淘汰掉或者冷却即可。“赚钱”的Agent被公布出来，供入不敷出的Agent 学习。也许不应该用钱来衡量，或许效率或者输出质量和能源消耗更好的反应吧。
MCP：学习使用各种 AI 工具
Reasoning：和其他 Agent 交互过程，不仅仅是获取结果，尝试理解其推理/实现过程
知识图谱建立，高质量材料收集和整理
锻炼身体增加兴趣：激活大脑更多专家

2026展望

活动： KCD 北京 + KCD 杭州（申请中）+ KubeCon China 2026 上海

日本 KubeCon 2025 很成功，还是蛮想去一趟的。

AI Infra 目前学习的内容还比较浅，尤其是实际的模型知识、推理引擎、DRA 的实战都是没有的，这块可能是2026年的重点。

工作前三年感觉纯探索，之后五年在产品方向，最近五年基本在kube 开源方向，受到这波 AI 的冲击，似乎需要做一些转型，AI 焦虑下如何定位自己变得越来越难了：模型训练、推理服务与编排、Agent Workflow、路由管理、KV Cache、P D分离、Gang 调度、GPU 管理（利用率提升）、沙箱预热、超大规模、成本优化、多租户隔离、可观测。看上去处处都是落脚点，实际上很多都是浅尝辄止。虽然如此，明年依然计划会在更多新方向看看，甚至需要更耐心的去参与下 vLLM、SGlang 或者 AAIF 方向。

26年娃准备上托班和后续正式上幼儿园了，可能又是一个新的阶段了。带娃真的就是耐心和时间，无他。不想鸡娃，重视身体锻炼，增加户外。老婆在准备公开课和其他任务的时候今年也累倒好几次，还是要重视锻炼，加强抵抗力。

写在最后，也许每个人需要 3-5年整体回顾下

不管是工作还是人生，都需要周期性的做一些深度的回顾。

祖辈还在种地（焦虑吃不饱），父辈基本都在工厂学校（焦虑不稳定），平辈白领居多（竞争焦虑），未来也许都是 AI 带来的各种工作（巨大的不确定性）。短短大几十年，感觉思想和规则都跟不上技术变化，地球/国家/大企业都像是一台庞大的机器，也像是 Kubernetes 社区，已经经过了几十轮甚至更多的迭代，核心稳定之后发现 AI 这波有需要不少巨大的变化要去适应。每个阶段 3-5年，我们的焦虑点会有所不同，但似乎难以避免。

圣诞快乐 🧑‍🎄

http://pacoxu.wordpress.com/?p=1512

Extensions

Agones: Kubernetes-Native Game Server Hosting

22号_马修 Dec 9, 2025 Updated Mar 3, 2026

Introduction As the gaming industry grows rapidly, the […]

Show full content

Agones brings dedicated game server hosting to Kubernetes, enabling multiplayer gaming infrastructure with cloud-native scalability and management. This blog explores Agones as it applies to join CNCF Sandbox.

Introduction

As the gaming industry grows rapidly, the demand for scalable, reliable dedicated game server infrastructure has become critical. Agones is an open-source platform built on Kubernetes that addresses this need by providing a specialized solution for hosting, running, and scaling dedicated game servers.

Agones, derived from the Greek word “agōn” meaning “contest” or “competition at games”, transforms Kubernetes into a powerful platform for managing game server workloads with the same cloud-native principles used for traditional applications.

Project Status: Agones has applied to join the CNCF Sandbox (github.com/cncf/sandbox/issues/440), marking an important step in bringing gaming workloads into the cloud-native ecosystem.

What is Agones?

Agones is a library for hosting, running, and scaling dedicated game servers on Kubernetes. It replaces bespoke or proprietary cluster management solutions with Kubernetes-native APIs and controllers.

Core Concept: Dedicated game servers are stateful, ephemeral workloads that differ significantly from typical web applications. Each game session requires its own isolated server process, must maintain consistent network identity, and needs specialized lifecycle management. Agones extends Kubernetes to handle these unique requirements through Custom Resource Definitions (CRDs) and controllers.

Key Features

GameServer CRD: Define individual game servers declaratively using YAML or the Kubernetes API, complete with health checking and connection information
Fleet Management: Manage large groups of game servers as Fleets, similar to Kubernetes Deployments but optimized for game server workloads
Autoscaling: Native integration with Kubernetes cluster autoscaling, allowing Fleets to scale based on game server demand
Client SDKs: SDKs for multiple languages (Go, C#, C++, Rust, Node.js, REST) enabling game servers to communicate with the Agones control plane
Lifecycle Management: Automatic health checks, graceful shutdown handling, and state management for game server processes
Metrics and Observability: Game server-specific metrics exports and dashboards for operations teams

Architecture and Design

Agones extends Kubernetes with custom controllers and resources specifically designed for game server workloads:

Custom Resources

GameServer: Represents a single dedicated game server instance with health status, network ports, and connection information
Fleet: Manages groups of GameServers, providing replica management, rolling updates, and scaling capabilities
FleetAutoscaler: Automates Fleet scaling based on buffer policies, webhook policies, or counter/list-based policies
GameServerAllocation: Enables matchmakers to atomically allocate Ready GameServers from a Fleet for player connections

How It Works

Deployment: Operators define GameServers or Fleets using Kubernetes manifests
Lifecycle Management: Agones controllers create pods and manage their lifecycle based on game server state
Ready State: Game servers use the Agones SDK to mark themselves Ready when accepting connections
Allocation: Matchmaking systems request GameServer allocation via the Kubernetes API
Session Management: Game servers notify Agones when sessions end, triggering cleanup
Autoscaling: FleetAutoscalers monitor Fleet status and adjust replicas to maintain desired buffer or respond to custom policies

Use Cases and Production Adoption

Agones is designed for multiplayer gaming scenarios requiring dedicated game servers:

Session-based multiplayer games: FPS, MOBA, battle royale games where each match runs on a dedicated server
Persistent game worlds: MMO game zones or shards that require long-lived server processes
Match-based esports: Competitive gaming infrastructure requiring consistent server performance
Cross-platform gaming: Unified infrastructure for console, PC, and mobile multiplayer experiences

The project is already used in production by major gaming companies and has proven its reliability at scale. The CNCF sandbox application notes that “this project is already used in production by many” organizations.

Why CNCF?

According to the CNCF Sandbox application:

Since Agones is tightly coupled to Kubernetes, CNCF is the logical home for the project. Agones being in the CNCF allows for a broader community contributor ecosystem.

Agones brings a new gaming offering to the CNCF landscape, representing a specific but important use case for Kubernetes. As cloud-native technologies expand into specialized domains, gaming infrastructure represents a significant workload category with unique requirements.

Cloud-Native Integration

Agones integrates directly with core CNCF projects:

Kubernetes: Built as a Kubernetes controller with CRDs
Prometheus: Metrics exports for monitoring game server health and performance
Helm: Installation and configuration via Helm charts
Container runtimes: Works with any Kubernetes-compatible container runtime

Project Governance and Community

Agones operates as a vendor-neutral open-source project:

License: Apache 2.0
Code of Conduct: Contributor Covenant
Governance: Clear contribution guidelines and ownership model
Community Channels: Active Slack workspace, mailing list, regular community meetings
Maintained by: Originally created by Google Cloud, now community-driven with multiple maintainers

The project has comprehensive documentation, quickstart guides, and example implementations for developers getting started with game server hosting on Kubernetes.

Similar Projects and Ecosystem

Within the Kubernetes gaming ecosystem, OpenKruise’s kruise-game (github.com/openkruise/kruise-game) provides similar capabilities. Both projects demonstrate growing interest in gaming workloads on Kubernetes.

Agones’ application to CNCF Sandbox represents an opportunity to establish standards and best practices for game server orchestration across the cloud-native community.

Vision and Roadmap

Agones continues active development with regular releases following a documented release process. The project roadmap focuses on:

Enhancing autoscaling capabilities with more sophisticated policies
Improving observability and debugging tools for game server operations
Expanding SDK support for additional programming languages and engines
Performance optimizations for larger-scale deployments
Better integration with matchmaking and lobby systems

The project aims to make dedicated game server hosting as straightforward and reliable as deploying stateless web applications, while respecting the unique requirements of real-time gaming workloads.

Getting Started

For developers interested in exploring Agones:

Documentation: Comprehensive guides at agones.dev/site/docs/
Quick Start: Install Agones on a Kubernetes cluster and deploy a simple game server
Examples: Multiple example game server implementations in the repository
Community: Join the Agones Slack and mailing list for support and discussion

Agones represents the maturation of gaming infrastructure into the cloud-native era, bringing the operational benefits of Kubernetes to one of the most demanding real-time workload types.

Conclusion

Agones transforms Kubernetes into a powerful platform for dedicated game server hosting, addressing the unique challenges of multiplayer gaming infrastructure. As it applies to join the CNCF Sandbox, the project demonstrates how cloud-native technologies can adapt to specialized workload requirements while maintaining Kubernetes-native principles.

For gaming companies building multiplayer experiences and infrastructure teams managing game servers, Agones provides a proven, production-ready solution that leverages the full ecosystem of cloud-native tools and practices.

References:

Agones GitHub: github.com/googleforgames/agones
Official Website: agones.dev/site/
CNCF Sandbox Application: github.com/cncf/sandbox/issues/440
Announcement Blog: cloud.google.com/blog/products/containers-kubernetes/ introducing-agones-open-source-multiplayer-dedicated-game-server-hosting- built-on-kubernetes

http://pacoxu.wordpress.com/?p=1499

Extensions

How to choose the inference orchestration solution? AIBrix or Kthena or Dynamo?

22号_马修 Dec 3, 2025 Updated Mar 3, 2026

Note: The content in this article is based on currently […]

Show full content

Note: The content in this article is based on currently available public information and is intended for technical reference only. The effectiveness of each solution depends heavily on your specific workload, infrastructure, and ecosystem integration. The architectural affiliations and early design choices mentioned here do not determine their future direction. In practice, community activity, openness, and long-term evolution are often more important factors. Please evaluate and choose based on your own scenario.

Introduction

The landscape of open-source inference orchestration for Large Language Models (LLMs) has evolved rapidly in 2025. Multiple projects have emerged to address the challenges of deploying and scaling LLM inference workloads on Kubernetes, each with its own approach to workload management, resource orchestration, and performance optimization.

This blog post provides an overview of the current inference orchestration solutions, examines the convergence trends in the ecosystem, and raises important questions about when Prefill-Decode (PD) disaggregation truly provides value.

The Current Landscape

Rapid Development, Gradual Convergence

The inference orchestration space is characterized by:

Many implementations: Multiple projects solving similar problems
Different architectural choices: Varying approaches to workload management
Shared goals: All aim to optimize LLM inference at scale
Emerging patterns: Common solutions beginning to emerge

Despite the diversity, we’re seeing convergence around key patterns: LeaderWorkerSet (LWS)-based architectures, intelligent routing, and disaggregated serving models.

Workload Orchestration Solutions

1. Dual LWS Architecture

llm-d implements a dual LeaderWorkerSet architecture for Prefill-Decode disaggregation:

Two LWS instances: Separate LWS for prefill and decode workers
KServe integration: Deep integration with KServe for model serving
LMCache support: Efficient KV cache management across workers
Routing sidecar: Intelligent request routing and cache optimization

Client → Routing Sidecar → Prefill LWS → KV Cache → Decode LWS → Response

Why dual LWS? This architecture enables independent scaling and resource optimization for each phase while maintaining coordination through the leader-worker pattern.

2. Serving Group: Volcano Kthena

Kthena takes a different approach with its Serving Group concept:

No dual LWS: Kthena intentionally avoids the dual LWS pattern
Gang scheduling integration: Leverages Volcano’s gang scheduling capabilities
Reduced layering: Eliminates the StatefulSet/Pod layer complexity
Direct integration: Native integration with Volcano scheduler

Why not LWS? The Kthena team found that integrating with Volcano’s gang scheduling required a different architecture. The dual LWS, StatefulSet, and Pod layering added complexity without clear benefits for their use case.

This design choice reflects a key insight: the best orchestration solution depends on your existing infrastructure and scheduling requirements.

3. StormService: AIBrix

AIBrix StormService provides specialized container lifecycle management for P/D disaggregation:

P/D lifecycle management: Fine-grained control over prefill and decode containers
Multi-mode support: TP, PP, single GPU, and P/D disaggregation
StormService and RoleSet CRDs: Custom resources for P/D orchestration
Enterprise features: Multi-tenancy, routing, and observability

Architecture:

AIBrix Control Plane
    ├── StormService Controller
    │   ├── RoleSet (Prefill)
    │   └── RoleSet (Decode)
    ├── Gateway & Routing
    └── Autoscaler

4. NVIDIA Dynamo: Two Modes

Dynamo offers two distinct deployment modes:

Grove Mode: https://github.com/ai-dynamo/dynamo/blob/be67f67b1a8d0837291ac7033af6edbc146f6995/docs/kubernetes/grove.md

High-performance inference
NVIDIA-native deployment
Optimized for pure NVIDIA infrastructure
- “GPU support depends on the engine: Dynamo uses backends vllm, sglang and trt-llm. Dynamo is the layer above that.” quota

LWS Mode:

Kubernetes-native deployment using LeaderWorkerSet
Multi-node disaggregated serving
Integration with Kubernetes ecosystem

This dual-mode approach allows users to choose the right level of abstraction for their infrastructure.

5. SGLang RBG: LWS-Inspired

RBG (Resource-Aware Batch Scheduler) learned from and reused design patterns from LWS:

LWS-inspired: Incorporates proven patterns from LeaderWorkerSet
Resource-aware scheduling: Optimizes batch scheduling based on resources
Batch optimization: Intelligent batching strategies for throughput
P/D support: Enables disaggregated prefill and decode workloads

Convergence Trends

Common Patterns Emerging

Despite different implementations, several patterns are converging:

Patternllm-dKthenaAIBrixDynamoRBGLWS-based✓ (dual)✗✗✓ (option)✓ (inspired)P/D disaggregation✓✓✓✓✓Intelligent routing✓✓✓✓✓KV cache managementLMCacheNativeDistributedNativeNative Why So Many Implementations?

The diversity reflects different optimization goals:

Scheduling integration: Kthena needs Volcano gang scheduling directly
Enterprise features: AIBrix focuses on multi-tenancy and observability
Performance focus: Dynamo optimizes for NVIDIA hardware
Simplicity: RBG provides a lightweight LWS-inspired approach
Production-readiness: llm-d demonstrates a complete reference implementation

The PD Disaggregation Question

At KCD Hangzhou 2025, Wen Yuan Yu’s keynote “Kubernetes Is Born for Service Resource Orchestration—MaaS Changes Everything” raised an important question about PD-separation:

“Achieving strong production gains from PD-separation is very difficult.
While stress testing can show great results, in real dynamic environments it becomes much harder.
Over-provisioning Decode introduces significant challenges.”

This observation directly challenges the assumption that PD-separation is always beneficial.

Does PD Disaggregation Always Provide Value?

At KCD Hangzhou 2025, Yu Wen Yuan’s keynote “Kubernetes Was Built for Service-Resource Orchestration. MaaS Changes Everything” raised important questions about PD disaggregation:

“PD-Disaggregate Role Scheduling • Not So Sure? (Our answer is Data Plane!)”

This challenges the assumption that PD disaggregation is always beneficial.

When PD Disaggregation Helps

PD disaggregation provides clear benefits when:

Long prefill, short decode: Input prompts are much longer than outputs
High concurrency: Many simultaneous requests need serving
Heterogeneous hardware: Different GPU types for different phases
SLA-driven scheduling: Different latency requirements (TTFT vs TPOT)

When PD Disaggregation May Not Help

Consider alternatives when:

Short contexts: Both prefill and decode are fast
Low concurrency: Few simultaneous requests
Homogeneous hardware: Same GPUs for all workloads
Complexity costs: Operational overhead outweighs benefits
KV cache transfer overhead: Network latency exceeds computation savings

The Data Plane Perspective

The “Data Plane” answer suggests that the value of PD disaggregation depends on where bottlenecks actually exist. Before implementing complex orchestration:

Profile your workload: Understand where time is spent
Measure KV cache transfer costs: Network overhead matters
Consider simpler alternatives: TP/DP without disaggregation
Evaluate operational complexity: More components = more failure modes

Configuration Optimization: AIConfigurator

Choosing the right P/D configuration is complex. NVIDIA’s AIConfigurator helps optimize disaggregated deployment configurations:

What AIConfigurator Does

Configuration space search: Evaluates thousands of P/D combinations
SLA-constrained optimization: Finds configurations meeting TTFT/TPOT targets
Hardware-specific tuning: Supports H100, H200, B200 with collected data
xPyD planning: Determines optimal prefill/decode worker ratios

Example Usage

# Find optimal configuration for Qwen3-32B on 32 H200 GPUs
# with SLA targets: TTFT ≤ 300ms, TPOT ≤ 10ms
aiconfigurator cli default \
  --model QWEN3_32B \
  --total_gpus 32 \
  --system h200_sxm \
  --isl 4000 \
  --osl 500 \
  --ttft 300 \
  --tpot 10

Why AIConfigurator Matters

Traditional autoscaling (HPA/KPA) doesn’t understand LLM-specific characteristics. AIConfigurator provides:

Informed decisions: Data-driven configuration choices
Predictive optimization: Estimate performance before deployment
Resource efficiency: Maximize GPU utilization with SLA guarantees

Recommendations

For New Deployments

Start simple: Begin with monolithic serving (no P/D disaggregation)
Profile first: Understand your workload characteristics
Use AIConfigurator: Let data guide configuration decisions
Add complexity gradually: Introduce P/D only when benefits are clear

For Existing Infrastructure

If you use…Consider…VolcanoKthena (native integration)KServellm-d (deep integration)vLLMAIBrix (vLLM ecosystem)NVIDIA GPUsDynamo (NVIDIA optimization)SGLangRBG (LWS-inspired, lightweight) Key Questions Before Adopting PD Disaggregation

Is your prefill time >> decode time? If not, disaggregation may not help.
Can your network handle KV cache transfer? Network overhead can eliminate gains.
Do you need independent scaling? If P and D scale together, keep them together.
Is operational complexity acceptable? More components = more failure modes.

Conclusion

The inference orchestration landscape is diverse but converging. Key takeaways:

Multiple solutions exist because different infrastructure has different needs
LWS-based patterns are popular but not universal (Kthena’s Serving Group shows alternatives)
PD disaggregation is not always valuable – profile your workload first
Tools like AIConfigurator help navigate the complex configuration space
Start simple, add complexity when needed based on actual measurements

The future will likely see further consolidation around proven patterns, but the current diversity reflects healthy experimentation in a rapidly evolving field.

References

Workload Orchestration Projects

llm-d – Dual LWS architecture for P/D
Kthena – Volcano-based Serving Group
AIBrix – StormService for P/D
Dynamo – NVIDIA inference platform
RBG – LWS-inspired batch scheduler

Configuration Tools

AIConfigurator – P/D configuration optimizer

Related Blog Posts

http://pacoxu.wordpress.com/?p=1469

Extensions

Smarter Scheduling for AI Workloads: Topology-Aware Scheduling

22号_马修 Nov 28, 2025 Updated Mar 3, 2026

Why Topology? Why Now? At KubeCon NA 2025, one theme do […]

Show full content

Why Topology? Why Now?

At KubeCon NA 2025, one theme dominated conversations in the AI/ML space:
topology. Everyone is talking about topology-aware scheduling because it’s
critical for optimizing AI workload performance.

Source: Lightning Talk: Mind the Topology – Roman Baron, NVIDIA

Modern AI workloads, especially distributed training and high-performance
inference, are extremely sensitive to hardware topology. When GPUs, NICs, CPUs,
and memory are not properly aligned within the same NUMA node, PCIe root, or
network fabric, performance can degrade by 30-50% or more.

Background: Current Topology Scheduling Support Device Plugin: The Traditional Approach

Kubernetes Device Plugins have been the standard mechanism for managing
hardware resources like GPUs. The Device Plugin API provides:

Source: KubeCon NA 2025: Device Management

Key Components:

GetDevicePluginOptions: Plugin configuration
ListAndWatch: Report available devices to kubelet
GetPreferredAllocation: Suggest optimal device allocation (topology hint)
Allocate: Perform device allocation for containers
PreStartContainer: Pre-container-start hooks

Device Plugin supports:

Basic GPU counting (e.g., nvidia.com/gpu: 8)
MIG (Multi-Instance GPU) partitioning
Time-slicing for GPU oversubscription

Limitations of Device Plugin

However, Device Plugins have significant limitations for topology-aware
scheduling:

Source: KubeCon NA 2025: Device Management

Static isolation config: MIG configurations must be pre-defined
Static slicing config: Time-slicing ratios are fixed at deployment
Only even sharing expected: Limited sharing granularity
Requires secondary scheduler: Complex topologies need additional
schedulers like Volcano or Kueue

Kueue: Topology-Aware Scheduling

Kueue provides topology-aware
scheduling through node labels. It uses hierarchical topology levels like:

# Node labels for rack/block topology
cloud.google.com/gce-topology-block: "block-1"
cloud.google.com/gce-topology-subblock: "subblock-1"
cloud.google.com/gce-topology-host: "host-1"
kubernetes.io/hostname: "node-1"

Kueue supports:

TopologyAwareScheduling: Place workload pods on nodes with matching
topology
Cohort-based resource sharing: Share resources within topology groups
Gang scheduling with topology: Ensure all gang members are
topology-aligned

Kueue Topology Configuration Example:

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: gpu-topology
spec:
  nodeLabels:
    cloud.google.com/gce-topology-block: "block-1"
  nodeTaints:
  - effect: NoSchedule
    key: nvidia.com/gpu
    value: "present"

Volcano: Gang Scheduling with Topology

Volcano provides advanced scheduling
features including:

Gang scheduling: All-or-nothing scheduling for distributed workloads
Topology plugin: Consider GPU topology in scheduling decisions
Network-aware scheduling: RDMA/InfiniBand fabric awareness

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: distributed-training
spec:
  minMember: 8
  minResources:
    nvidia.com/gpu: "8"
  queue: training-queue
  # Topology affinity for NVLink connectivity
  topologyPolicy: "best-effort"

DRA: The Next Generation of Topology Management

Dynamic Resource Allocation (DRA)
represents a fundamental shift in how Kubernetes handles device topology. DRA
provides structured parameters that enable rich topology expression and
constraint specification.

How DRA Handles Topology-Aware Scheduling

DRA uses attributes and constraints with CEL (Common Expression
Language) to express topology requirements. The key mechanisms include:

Device Attributes: Each device publishes topology information

pcieRoot: PCIe hierarchy identifier
numaNode: NUMA node association
nvlinkDomain: NVLink fabric identifier
rdmaDevice: Associated RDMA NIC

Constraints: CEL expressions that enforce topology rules

Same PCIe root for GPU and NIC
Same NUMA node for CPU and memory
NVLink connectivity between GPUs

SharedID: Devices on the same topology domain get a shared identifier

GPU + NIC Topology Coordination

The most powerful use case for DRA topology is coordinating GPU and NIC
allocation on the same PCIe root. This is critical for RDMA-based distributed
training where GPU-Direct is used.

ResourceClaimTemplate with PCIe Topology Constraint Example:

apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: gpu-nic-topology
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: nvidia-gpu
        count: 1
      - name: rdma-nic
        deviceClassName: rdma-nic
        count: 1
      constraints:
      # GPU and NIC must be on the same PCIe root
      - requests: ["gpu", "rdma-nic"]
        matchAttribute: pcieRoot

How this works:

The DRA scheduler evaluates available GPUs and NICs
For each candidate GPU, it finds NICs on the same PCIe root
Only allocations satisfying the constraint are considered
The matchAttribute: pcieRoot ensures both devices share the same
PCIe topology

DRANET: Network Device DRA

DRANET is Google’s DRA implementation for
network devices. It integrates with Kueue’s topology-aware scheduling using
node labels:

# DRANET uses these labels for topology awareness
cloud.google.com/gce-topology-block
cloud.google.com/gce-topology-subblock
cloud.google.com/gce-topology-host
kubernetes.io/hostname

DRANET + NVIDIA GPU DRA can coordinate:

RDMA NICs allocated with GPUs on same PCIe fabric
Multi-NIC configurations for distributed training
Network isolation using SR-IOV VFs

CPU Micro-Topology Support

The dra-driver-cpu
project is adding CPU micro-topology support including:

NUMA-aware CPU allocation
CPU pinning with topology alignment
Coordination with GPU NUMA placement

DRAConsumableCapacity: New in Kubernetes 1.34

A major advancement in DRA is the DRAConsumableCapacity feature:

Source: KubeCon NA 2025: Device Management

Key Capabilities:

Alpha feature introduced in Kubernetes 1.34
Recommended to start using from Kubernetes 1.35 (still in Alpha)

Core abilities:

Allow multiple allocations over multiple resource requests
Consumable capacity: Guaranteed resource sharing

Potential use cases:

Virtual GPU Memory Partitioning
Virtual NIC (vNIC) Sharing
Bandwidth-limited Network Allocation
I/O Bandwidth Smart Storage Device Sharing
Native Resource Request (CPU)

This enables much more flexible resource sharing while maintaining topology
awareness.

Challenges: Device Plugin to DRA Migration

Many organizations have invested heavily in Device Plugin-based solutions.
Migrating to DRA presents several challenges:

1. Existing Device Plugin Investments

Organizations may have:

Custom Device Plugins with topology logic
Integration with monitoring and observability tools
Operator workflows depending on Device Plugin APIs

2. Coexistence Problems

Running Device Plugin and DRA together can cause:

Resource conflicts: Same device managed by both systems
Topology inconsistency: Different topology views between systems
Scheduling confusion: Scheduler doesn’t have unified view

3. Feature Gaps

Some Device Plugin features don’t have DRA equivalents yet:

Device health monitoring: Device Plugin has built-in health checks
Hot-plug support: Device Plugin supports dynamic device addition
Metrics integration: Prometheus metrics from Device Plugins

Solutions and Workarounds

DRA Extension Capabilities:

DRA drivers can implement compatibility layers
NVIDIA’s DRA driver supports Device Plugin migration path
NRI integration can bridge runtime-level gaps

Recommended Migration Path:

Deploy DRA driver alongside existing Device Plugin
Use node taints to partition workloads
Gradually migrate workloads to DRA-based resource claims
Phase out Device Plugin once all workloads migrated

Related KubeCon Talks

Several excellent talks from KubeCon NA 2025 cover these topics:

Lightning Talk: Mind the Topology

Mind the Topology: Smarter Scheduling for AI Workloads on Kubernetes
by Roman Baron, NVIDIA

Key topics:

Why topology matters for AI workloads
NVIDIA KAI Scheduler for topology-aware scheduling
NVIDIA KAI-Scheduler

Device Management Deep Dive

Deep dive into DRA and Device Plugin

Key topics:

Evolution from Device Plugin to DRA
DRAConsumableCapacity feature
Multi-device topology coordination

Best Practices for Topology-Aware Scheduling

Understand your topology requirements

Profile workloads to identify topology sensitivity
Map hardware topology (PCIe, NUMA, NVLink, RDMA)

Choose the right scheduling approach

Simple GPU workloads: Device Plugin + Topology Manager
Complex multi-device: DRA with constraints
Distributed training: Kueue or Volcano + DRA

Label nodes with topology information

Use consistent labeling scheme
Include rack, block, and host-level topology

Test topology impact

Benchmark with and without topology alignment
Measure latency and throughput differences

Plan for migration

Start with new workloads on DRA
Create compatibility tests
Document topology requirements

Conclusion

Topology-aware scheduling has evolved from a nice-to-have feature to a critical
requirement for AI workloads. The transition from Device Plugin to DRA
represents a fundamental shift in how Kubernetes manages hardware topology:

Device Plugin: Simple, established, but limited topology support
DRA: Rich topology expression, multi-device coordination, future of
Kubernetes device management

As AI workloads continue to grow in complexity, the need for sophisticated
topology-aware scheduling will only increase. Whether you’re using Kueue,
Volcano, or native Kubernetes scheduling, understanding topology and planning
for DRA adoption is essential for optimizing your AI infrastructure.

Resources Projects

Documentation

Videos

http://pacoxu.wordpress.com/?p=1465

Extensions

Kubernetes Introduces Native Gang Scheduling Support to Better Serve AI/ML Workloads

22号_马修 Nov 26, 2025 Updated Mar 3, 2026

中文 https://mp.weixin.qq.com/s/EO0yfdVQMNgKI7nqkJ18Yw Ku […]

Show full content

中文 https://mp.weixin.qq.com/s/EO0yfdVQMNgKI7nqkJ18Yw Kubernetes 支持原生 Gang Scheduling ：适应 AI/ML 工作负载
Introduction

Scheduling large workloads in Kubernetes has always been challenging. When you need to run distributed training jobs, batch processing tasks, or other multi-pod applications, the traditional pod-by-pod scheduling approach can lead to resource wastage, deadlocks, and inefficiencies. Today, we’re excited to share insights about the Workload Aware Scheduling initiative that’s transforming how Kubernetes handles multi-pod workloads.

The Problem with Traditional Pod Scheduling

In traditional Kubernetes scheduling, each pod is scheduled independently. For distributed workloads like:

Distributed ML training (e.g., PyTorch, TensorFlow multi-worker jobs)
Batch processing (e.g., Apache Spark, Ray clusters)
High-performance computing (e.g., MPI applications)

This independent scheduling creates several problems:

Partial scheduling deadlocks: Some pods get scheduled while others wait indefinitely for resources
Resource wastage: Scheduled pods consume resources but can’t start work until all peers are ready
Poor cluster utilization: Resources are tied up by incomplete workloads
Unpredictable job completion times: Jobs may wait hours or days in partially-scheduled states

Kubernetes v1.35: Workload Aware Scheduling

The Kubernetes community has introduced Workload Aware Scheduling in v1.35, featuring three major components:

1. Workload API (Alpha)

The new Workload API resource in scheduling.k8s.io/v1alpha1 provides a structured way to define scheduling requirements for multi-pod applications.

apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
  name: training-job-workload
  namespace: ml-workloads
spec:
  podGroups:
  - name: workers
    policy:
      gang:
        # All-or-nothing: schedule only if 4 pods can run together
        minCount: 4

Link your pods to the workload:

apiVersion: v1
kind: Pod
metadata:
  name: worker-0
  namespace: ml-workloads
spec:
  workloadRef:
    name: training-job-workload
    podGroup: workers
  containers:
  - name: trainer
    image: my-ml-framework:latest
    resources:
      requests:
        nvidia.com/gpu: 1

2. Gang Scheduling (Alpha)

Gang scheduling implements the all-or-nothing placement strategy:

How it works:

Waiting Phase: When pods arrive, the scheduler blocks them until minCount pods are pending
Evaluation Phase: The scheduler attempts to find suitable nodes for all pods in the gang
Decision Phase:
- Success: If all pods can be placed, they’re bound to nodes together
- Failure: If any pod can’t be placed within timeout (5 minutes), ALL pods are rejected and requeued

This prevents resource waste and ensures your distributed workload either runs completely or waits for sufficient resources.

Key benefits:

Eliminates partial scheduling deadlocks
Improves cluster utilization by freeing resources for runnable workloads
Provides predictable behavior for distributed applications
Works seamlessly with pod preemption and autoscaling

3. Opportunistic Batching (Beta)

Opportunistic Batching is a performance optimization that speeds up scheduling of identical pods without requiring any configuration changes.

How it works:

When the scheduler processes pods with identical scheduling requirements (same resources, images, affinities, etc.), it can reuse feasibility calculations and scoring results for subsequent pods in the queue.

Performance impact:

Dramatically reduces scheduling latency for large homogeneous workloads
Can improve scheduling throughput by 5-10x for batch workloads
Works transparently – no user configuration needed
Enabled by default in Kubernetes v1.35 (Beta)

Current restrictions:

Disabled for pods using topology spread constraints
Disabled for pods using Dynamic Resource Allocation (DRA)
All scheduling-relevant pod fields must be identical

Real-World Use Cases Distributed ML Training

apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
  name: pytorch-training
spec:
  podGroups:
  - name: workers
    policy:
      gang:
        minCount: 8  # Need 8 GPUs for distributed training

Your PyTorch distributed training job only starts when all 8 workers can be scheduled, preventing wasted GPU resources.

Apache Spark on Kubernetes

apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
  name: spark-job
spec:
  podGroups:
  - name: executors
    policy:
      gang:
        minCount: 10  # 1 driver + 9 executors minimum

Spark jobs with gang scheduling avoid the common problem where the driver starts but executors can’t be scheduled.

Ray Clusters

Ray applications benefit from gang scheduling by ensuring the head node and worker nodes start together, enabling immediate distributed computation.

The Roadmap: What’s Coming in 1.36 and Beyond

The Workload Aware Scheduling effort has an ambitious roadmap for Kubernetes 1.36:

Planned for v1.36

Expanding Workload API: Enhanced capabilities and refinements based on alpha feedback
Auto-workload for Job, StatefulSet, JobSet: Automatic workload creation for common Kubernetes resources
Topology Aware Scheduling: Consider network and hardware topology when placing gang members
Single-cycle workload scheduling: Schedule entire gangs in a single scheduling cycle for better performance
Tree-based workload scheduling algorithm: More efficient gang placement decisions
Improved binding process: Better handling of kubelet races using nominations
Delayed preemption: Introduce nominating victims before actual eviction
Workload-level preemption: Preempt entire gangs rather than individual pods

Long-term Vision

The ultimate goal is to make Kubernetes natively understand and optimize for workload-level operations, including:

Deep integration with cluster autoscaling
Workload-aware resource quotas and limits
Better support for mixed workload types (batch + serving)
Enhanced observability for multi-pod applications

Upcoming Official Blog Post

The Kubernetes community is preparing an official blog post about Workload Aware Scheduling that will be published soon on the Kubernetes blog. Watch for kubernetes/website#53012 to be merged for the official announcement.

Getting Started Prerequisites

Kubernetes v1.35 or later
Feature gates configured on kube-apiserver and kube-scheduler

Enable Workload API and Gang Scheduling

# On kube-apiserver
--feature-gates=GenericWorkload=true
--runtime-config=scheduling.k8s.io/v1alpha1=true

# On kube-scheduler
--feature-gates=GenericWorkload=true,GangScheduling=true

Enable Opportunistic Batching

Opportunistic Batching is enabled by default in v1.35 as a Beta feature. To disable it:

# On kube-scheduler
--feature-gates=OpportunisticBatching=false

Testing Gang Scheduling

Create a Workload resource
Create pods with workloadRef pointing to the Workload
Observe scheduling behavior in kube-scheduler logs
Monitor metrics for gang scheduling success/failure rates

Best Practices

Set appropriate minCount: Consider your application’s minimum viable size
Use resource requests accurately: Gang scheduling depends on accurate resource requirements
Monitor scheduling metrics: Track gang scheduling success rates and timeout events
Test with cluster autoscaling: Ensure your autoscaler can provision nodes for gangs
Plan for failure scenarios: Understand timeout behavior and retry logic

Comparison with Existing Solutions

Before native gang scheduling, users relied on:

Volcano: CNCF incubating project with gang scheduling
Kueue: Kubernetes SIG project for queue and quota management
YuniKorn: Apache project with gang scheduling support
Custom schedulers: In-house solutions for specific use cases

Why use native gang scheduling?

Maintained by Kubernetes SIG Scheduling
Integrated with core scheduler features (preemption, autoscaling)
No additional components to deploy and maintain
Part of the Kubernetes conformance suite (eventually)

When to use external schedulers?

Need production-ready gang scheduling today (use Volcano or Kueue)
Require features beyond current Kubernetes roadmap
Have existing investments in specific schedulers

Resources and References KEPs and Documentation

Related Projects

Several projects currently support gang scheduling:

Volcano Scheduler – CNCF Incubating
- Full gang scheduling support
- Recently added LeaderWorkerSet (LWS) gang scheduling in v1.13 release
Koordinator – Alibaba Open Source
- Basic gang scheduling capabilities
- Workload orchestration and resource scheduling enhancements
Kueue – Kubernetes SIG Project
- CoScheduling support (a lighter version of gang scheduling)
- Focus on job queueing and quota management
YuniKorn – Apache Project
- Gang scheduling and resource scheduling capabilities

Community

SIG Scheduling: https://github.com/kubernetes/community/tree/master/sig-scheduling
Slack: #sig-scheduling on Kubernetes Slack

Conclusion

Gang Scheduling and Workload Aware Scheduling represent a major step forward for Kubernetes in supporting AI/ML, HPC, and batch processing workloads. The v1.35 alpha release provides a foundation for native multi-pod scheduling, with an exciting roadmap for v1.36 and beyond.

We encourage the community to:

Test these features in development environments
Provide feedback through GitHub issues
Share use cases and requirements
Contribute to the ongoing development

The future of Kubernetes scheduling is workload-aware, and the journey has just begun!

http://pacoxu.wordpress.com/?p=1457

Extensions

The Shift to cgroups v2 in Kubernetes: What You Need to Know

22号_马修 Oct 21, 2025 Updated Mar 3, 2026

As v1.35 will announce the cgroup v1 deprecation, kubel […]

Show full content

As v1.35 will announce the cgroup v1 deprecation, kubelet will fail on cgroup v1 with default configuration. FailCgroupV1 will be set to true by default. See more in coming blog https://github.com/kubernetes/website/pull/52814. Blow is what I wrote after cgroup v1 was announced to enter maintenance mode. As I linked a lot and can not finish is pretty complete, I stopped update https://github.com/kubernetes/website/pull/47342. Just publish it here for users who want to know more about why we should shift from cgroup v1 to v2 and the difference.

cgroups (control groups) are a Linux kernel feature used for managing system resources. Kubernetes uses cgroups to allocate resources like CPU and memory to containers, ensuring that applications run smoothly without interfering with each other. With the release of Kubernetes v1.31, cgroups v1 has been moved into [maintenance mode]/blog/2024/08/14/kubernetes-1-31-moving-cgroup-v1-support-maintenance-mode/). For cgroups v2, it graduated in v1.25 2 years ago.

Top FAQs are why we should migrate, what’s the benifits and lost, and what needs to be noticed when using cgroups v2.

cgroups v1 problem, and solutions in cgroups v2

cgroups v1 and cgroups official doc can be found in

Let’s enumerate some known issues.

active_file memory is not considered as available memory

There is a known issue of page cache: #43916.

In cgroups v1, we have no native solutions. Workarounds are setting larger memory limit for Pods or using some external projects to drop cache or throttling memory allocating when memory is beyond a threshold.
In cgroups v2, we can use memory.high to throttle.

Support for Memory QoS was initially added in Kubernetes v1.22, and later some limitations around the formula for calculating memory.high were identified. These limitations are addressed in Kubernetes v1.27.

However, until v1.31, the feature gate is still alpha due to another known issue that application pod may be hanging forever due to heavy memory reclaiming.

Container aware OOM killer and better OOM handling strategies

In cgroups v2, one process of a multi-processes Pod could be killed by the OOM killer. In this case, Pod has to use runit or supervisord to manage multi processes lifecycle.

cgroups v2 uses cgroup.kill file. Writing “1” to the file causes the cgroups and all descendant cgroups to be killed. This means that all processes located in the affected cgroup tree will be killed via SIGKILL. Pod may run multiple processes, and all processes can be killed simultaneously.

As mentioned above, cgroups v2 memory.high can throttle the new memory allocation and cgroups can be aware of the OOM earsiler. Besides, PSI can also help to know the memory load. oomd is a good example using PSI to implement a userspace out-of-memory killer.

Rootless support

In cgroups v1, delegating cgroups v1 controllers to less privileged containers may be dangerous.

Unlike cgroups v1, cgroups v2 officially supports delegation. Most Rootless Containers implementations rely on systemd for delegating v2 controllers to non-root users.

User Namespace minimal kernel version is 6.5, according to KEP-127.

What’s more?

eBPF stories:
- In cgroups v1, the device access control are defined in the static configuration/.
- cgroups v2 device controller has no interface files and is implemented on top of cgroup BPF.
- Cilium will automatically mount cgroups v2 filesystem required to attach BPF cgroup programs by default at the path /run/cilium/cgroupv2 .
PSI is planned in a future release KEP-4205, but pending due to runc 1.2.0 release delay.
monitoring tools support, like Cadvisor. Currently, cgroups v2 features are not fully-supported yet.

Adopting cgroup version 2

Requirements

Here’s what you need to use cgroup v2 with Kubernetes. First up, you need to be using a version of Kubernetes with support for v2 cgroup management; that’s been stable since Kubernetes v1.25 and all supported Kubernetes releases include this support.

OS distribution enables cgroups v2
Linux Kernel version is 5.8 or later
Container runtime supports cgroups v2. For example:
- containerd v1.4 or later (at the time of writing, containerd releases v1.6 and later are within that project’s support period)
- CRI-O v1.20 or later
The kubelet and the container runtime are configured to use the systemd cgroup driver

kernel updates around cgroups v2

cgroups v2 first appeared in Linux Kernel 4.5 in 2016.

In Linux 4.5, cgroups v2 io, memory & pid cgroups management were supported.
Linux 4.15 added support for cgroups v2 cpu management
Pressure Stall Information (PSI) support began with Linux 4.20.
The Kubernetes project does not recommend using cgroups v2 with a Linux kernel older than 5.2 due to lack of cgroup-level task freezer support.
In Kubernetes, 5.8 is the minimal kernel version for cgroups v2 as root cpu.stat file on cgroupv2 was only added on kernel 5.8.
memory.peak is added in 5.19.

Use systemd as cgroup driver

Configure the kubelet’s cgroup driver to match the container runtime cgroup driver.

The Container runtimes page explains that the systemd driver is recommended for kubeadm based setups instead of the kubelet’s default cgroupfs driver, because kubeadm manages the kubelet as a systemd service.

A minimal example of configuring the field explicitly:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd

In v1.31, KEP-4033 is beta to extend CRI API for the kubelet to discover the cgroup driver from the container runtime. This will help installer and kubelet to autodetect

Tools and commands for troubleshooting

Tools and commands that you should know about cgroups:

stat -fc %T /sys/fs/cgroup/: Check if cgroups v2 is enabled which will return cgroup2fs
systemctl list-units kube* --type=slice or --type=scope: List kube related units that systemd currently has in memory.
bpftool cgroup list /sys/fs/cgroup/*: List all programs attached to the cgroup CGROUP.
systemd-cgls /sys/fs/cgroup/*: Recursively show control group contents.
systemd-cgtop: Show top control groups by their resource usage.
tree -L 2 -d /sys/fs/cgroup/kubepods.slice: Show Pods’ related cgroups directories.

How to check if a Pod CPU or memory limit is successfully applied to the cgroup file?

Kubernetes Pod Spec: check limits spec.containers[*].resources.limits.{cpu,memory} and requests spec.containers[*].resources.requests.{cpu,memory}
CRI: cpu_period, cpu_quota, cpu_shares for CPU and memory_limit_in_bytes for memory limit
OCI Spec: memorry.limit, cpu.shares, cpu.quota, cpu.period
Systemd Scope Unit: CPUWeight, CPUQuotaPerSecUSec, CPUQuotaPeriodUSec, MemoryMax
Cgroupfs value: /sys/fs/cgroup/../cpu.weight, /sys/fs/cgroup/../cpu.max, /sys/fs/cgroup/../memory.max