Cloud Native Aarhus

Cloud Native AI and Whys the GPUs gone?

Capacity:
in-person
Event date
Apr 15, 26
04:30 PM - 08:00 PM CEST
Location
Kamstrup A/S, 51 Industrivej
About this event

AGENDA

16:30: Doors open

17:00: Welcome by Cloud Native Aarhus

17:10: Welcome by our hosts Kamstrup

17:15: Stop Wasting Your GPUs - Kubernetes Features You Already Have by Lucy Sweet Staff Engineer @ Uber

18:00: Break with food

18:30: Creating a Platform that can move where GPUs are available by Nicklas Frahm is the Director of Platform Engineering @ Corti

19:15: Networking

20:00: Doors close

Talk#1: Stop Wasting Your GPUs - Kubernetes Features You Already Have

Kubernetes has shipped more GPU and AI features in the last year than in the previous five combined. Dynamic Resource Allocation lets you request GPUs by actual properties instead of opaque integers, and share them across pods. In-Place Pod Vertical Scaling lets you resize CPU and memory without rescheduling your pod, so you keep your node, your IP, and your GPU. Container Restart Rules let you retry a container that exits with a retriable error code, like a GPU out-of-memory, without rescheduling the entire pod to a different node.

All three are GA or beta-on-by-default. They are in your cluster today if you're running Kubernetes 1.35. You probably aren't using them yet.

We'll walk through the features that matter most for GPU and AI workloads, show what they look like in practice, and explain what's coming next, including Resource Health Status (which will put GPU failure data directly in kubectl describe pod), eviction request, native gang scheduling, and declarative node maintenance.

Lucy Sweet is a Staff Engineer @ Uber and co-lead of the Kubernetes Node Lifecycle Working Group. Her team manages one of the largest Kubernetes deployments in the world. She has helped shape upstream features including In-Place Pod Resizing, the Eviction Request API, and Pod Level Resources.

Talk #2: Creating a Platform that can move where GPUs are available

GPUs are expensive and difficult to procure. That and geopolitical tension have motivated an internal push for a more sovereign infrastructure stack @ Corti. Most recently the platform team at Corti has been building an open source Kubernetes engine, called Kommodity, which is based on Talos Linux and Cluster API, as well as a KServe-based inference stack using vLLM and NVidia Triton. Now they are using it to move their infrastructure from Azure to Scaleway.

Nicklas Frahm is the Director of Platform Engineering at Corti, a vertical AI infrastructure lab. Their flagship model Symphony provides a wide variety of capabilities for the medical sector via an API that enables medical grade speech recognition, document generation, medical coding and an agentic framework.

Speakers
Organizers