publications | Andrew M. Zhang

2019

SoCC

CENTAUR: A Practical Serverless Framework for End-to-End ML Workflows

J. Carreira, P. Fonseca, A. Tumanov, A. Zhang, and Prof. Katz

Symposium on Cloud Computing,, 2019

Abs PDF

Machine learning (ML) workflows are extremely complex. The typical workflow consists of distinct stages of user interaction, such as preprocessing, training, and tuning, that are repeatedly executed by users but have heterogeneous computational requirements. This complexity makes it challenging for ML users to correctly provision and manage resources and, in practice, constitutes a significant burden that frequently causes over-provisioning and impairs user productivity. Serverless computing is a compelling model to address the resource management problem, in general, but there are numerous challenges to adopt it for existing ML frameworks due to significant restrictions on local resources.This work proposes Cirrus - an ML framework that automates the end-to-end management of datacenter resources for ML work-flows by efficiently taking advantage of serverless infrastructures. Cirrus combines the simplicity of the serverless interface and the scalability of the serverless infrastructure (AWS Lambdas and S3) to minimize user effort. We show a design specialized for both serverless computation and iterative ML training is needed for robust and efficient ML training on serverless infrastructure. Our evaluation shows that Cirrus outperforms frameworks specialized along a single dimension: Cirrus is 100x faster than a general purpose serverless system [36] and 3.75x faster than specialized ML frameworks for traditional infrastructures [49].

2018

NeurIPS

A Case for Serverless Machine Learning

J. Carreira, P. Fonseca, A. Tumanov, A. Zhang, and Prof. Katz

NuerIPS Workshop,, 2018

Abs PDF

The scale and complexity of ML workflows makes it hard to provision and manage resources—a burden for ML practitioners that hinders both their productivity and effectiveness. Encouragingly, however, serverless computing has recently emerged as a compelling solution to address the general problem of data center resource management. This work analyzes the resource management problem in the specific context of ML workloads and explores a research direction that leverages serverless infrastructures to automate the management of resources for ML workflows. We make a case for a serverless machine learning framework, specializing both for serverless infrastructures and Machine Learning workflows, and argue that either of those in isolation is insufficient.