Artificial Intelligence Technical Community Group

KubeTorch for Fault-tolerant ML in K8s

Attendees: 3
virtual
Event date
Jan 16, 26
08:00 AM - 09:00 AM PST
Location
Virtual event
About this event

Kubetorch is a system for fast, scalable, and fault-tolerant ML on Kubernetes. You can take any function or class and use the .to() APIs to dispatch them to run on Kubernetes; the "Torch" analogy refers to how Kubetorch makes Kubernetes as easy to command and debug as Torch did for GPUs.A few problems that Kubetorch solves:

  • Traditionally, iteration looks like some combination of: tearing down a running job, rebuilding Docker containers, requeueing for pods, redownloading artifacts, and then finally your training is restarted (/inference is redeployed). With Kubetorch, everything is held in place upon code change for 1-3 second iteration loops at distributed scale

  • Kubetorch gives imperative control over execution from a controller process outside of your pods; this gives fault-tolerance (catch errors on workloads in controller) while enabling complex orchestration for workloads like reinforcement learning directly on Kubernetes.

  • The APIs are entirely Pythonic, avoiding exposing ML researchers and engineers to YAML and platform nitty-gritty.

  • But Kubernetes is proven to work at scale with a rich ecosystem of observability and management. Kubetorch works with purely standard Kubernetes primitives rather than rebuilding them (like Ray), so it drops into existing platforms.

Organizers