Artificial Intelligence Technical Community Group

KubeTorch for Fault-tolerant ML in K8s

Capacity: 1000

virtual

Event date

Jan 16, 26

January 16, 2026

08:00 AM - 09:00 AM PST

Location

Virtual event

About this event

Kubetorch is a system for fast, scalable, and fault-tolerant ML on Kubernetes. You can take any function or class and use the .to() APIs to dispatch them to run on Kubernetes; the "Torch" analogy refers to how Kubetorch makes Kubernetes as easy to command and debug as Torch did for GPUs.A few problems that Kubetorch solves:

Traditionally, iteration looks like some combination of: tearing down a running job, rebuilding Docker containers, requeueing for pods, redownloading artifacts, and then finally your training is restarted (/inference is redeployed). With Kubetorch, everything is held in place upon code change for 1-3 second iteration loops at distributed scale
Kubetorch gives imperative control over execution from a controller process outside of your pods; this gives fault-tolerance (catch errors on workloads in controller) while enabling complex orchestration for workloads like reinforcement learning directly on Kubernetes.
The APIs are entirely Pythonic, avoiding exposing ML researchers and engineers to YAML and platform nitty-gritty.
But Kubernetes is proven to work at scale with a rich ecosystem of observability and management. Kubetorch works with purely standard Kubernetes primitives rather than rebuilding them (like Ray), so it drops into existing platforms.

Organizers