Serverless computing is an emerging paradigm for structuring applications in such a way that they can benefit from on-demand computing resources and achieve horizontal scalability. As such, it is an ideal substrate for the resource-intensive and often ad-hoc task of training deep
...
Serverless computing is an emerging paradigm for structuring applications in such a way that they can benefit from on-demand computing resources and achieve horizontal scalability. As such, it is an ideal substrate for the resource-intensive and often ad-hoc task of training deep learning models. However, the design and stateless nature of serverless platforms make it difficult to translate distributed machine learning systems directly to this new world. With KubeML, we present a purpose-built serverless machine learning system that runs atop Kubernetes and seamlessly embeds into the popular PyTorch framework. Unlike alternative systems, KubeML fully embraces GPU acceleration and is able to outperform TensorFlow, especially with smaller local batches, while allowing for higher resource density. KubeML reaches a 3.98x faster time-to-accuracy with small batch sizes, and maintains a 2.02x speedup between the top results of both platforms for commonly benchmarked machine learning models like ResNet34.