KubeTorch for Fault-tolerant ML in K8s
Kubetorch is a system for fast, scalable, and fault-tolerant ML on Kubernetes. You can take any function or class and use the .to() APIs to dispatch them to run on Kubernetes; the "Torch" analogy refers to how Kubetorch makes Kubernetes as easy to command and debug as Torch did for GPUs.A few problems that Kubetorch solves:
-
Traditionally, iteration looks like some combination of: tearing down a running job, rebuilding Docker containers, requeueing for pods, redownloading artifacts, and then finally your training is restarted (/inference is redeployed). With Kubetorch, everything is held in place upon code change for 1-3 second iteration loops at distributed scale
-
Kubetorch gives imperative control over execution from a controller process outside of your pods; this gives fault-tolerance (catch errors on workloads in controller) while enabling complex orchestration for workloads like reinforcement learning directly on Kubernetes.
-
The APIs are entirely Pythonic, avoiding exposing ML researchers and engineers to YAML and platform nitty-gritty.
-
But Kubernetes is proven to work at scale with a rich ecosystem of observability and management. Kubetorch works with purely standard Kubernetes primitives rather than rebuilding them (like Ray), so it drops into existing platforms.