Cloud Native AI and Whys the GPUs gone?
AGENDA
16:30: Doors open
17:00: Welcome by Cloud Native Aarhus
17:10: Welcome by our hosts Kamstrup
17:15: Stop Wasting Your GPUs - Kubernetes Features You Already Have by Lucy Sweet Staff Engineer @ Uber
18:00: Break with food
18:30: Creating a Platform that can move where GPUs are available by Nicklas Frahm is the Director of Platform Engineering @ Corti
19:15: Networking
20:00: Doors close
Talk#1: Stop Wasting Your GPUs - Kubernetes Features You Already Have
Kubernetes has shipped more GPU and AI features in the last year than in the previous five combined. Dynamic Resource Allocation lets you request GPUs by actual properties instead of opaque integers, and share them across pods. In-Place Pod Vertical Scaling lets you resize CPU and memory without rescheduling your pod, so you keep your node, your IP, and your GPU. Container Restart Rules let you retry a container that exits with a retriable error code, like a GPU out-of-memory, without rescheduling the entire pod to a different node.
All three are GA or beta-on-by-default. They are in your cluster today if you're running Kubernetes 1.35. You probably aren't using them yet.
We'll walk through the features that matter most for GPU and AI workloads, show what they look like in practice, and explain what's coming next, including Resource Health Status (which will put GPU failure data directly in kubectl describe pod), eviction request, native gang scheduling, and declarative node maintenance.
Lucy Sweet is a Staff Engineer @ Uber and co-lead of the Kubernetes Node Lifecycle Working Group. Her team manages one of the largest Kubernetes deployments in the world. She has helped shape upstream features including In-Place Pod Resizing, the Eviction Request API, and Pod Level Resources.
Talk #2: Creating a Platform that can move where GPUs are available
GPUs are expensive and difficult to procure. That and geopolitical tension have motivated an internal push for a more sovereign infrastructure stack @ Corti. Most recently the platform team at Corti has been building an open source Kubernetes engine, called Kommodity, which is based on Talos Linux and Cluster API, as well as a KServe-based inference stack using vLLM and NVidia Triton. Now they are using it to move their infrastructure from Azure to Scaleway.
Nicklas Frahm is the Director of Platform Engineering at Corti, a vertical AI infrastructure lab. Their flagship model Symphony provides a wide variety of capabilities for the medical sector via an API that enables medical grade speech recognition, document generation, medical coding and an agentic framework.