报告简介:
Distributed machine learning (ML) has played a key role in today’s proliferation of AI services. A typical model of distributed ML is to partition training datasets over multiple worker nodes to update model parameters in parallel, adopting a parameter server architecture. ML training jobs are typically resource elastic, completed using various time lengths with different resource configurations. A fundamental problem in a distributed ML cluster is how to explore the demand elasticity of ML jobs and schedule them with different resource configurations, such that the utilization of resources is maximized and average job completion time is minimized. To address it, we propose an online scheduling algorithm to decide the execution time window, the number and the type of concurrent workers and parameter servers for each job upon its arrival, with a goal of minimizing the weighted average completion time. Our online algorithm consists of (i) an online scheduling framework that groups unprocessed ML training jobs into a batch iteratively, and (ii) a batch scheduling algorithm that configures each ML job to maximize the total weight of scheduled jobs in the current iteration. Our online algorithm guarantees a good parameterized competitive ratio with polynomial time complexity. Extensive evaluations using real-world data demonstrate that it outperforms state-of-the-art schedulers in today’s AI cloud systems.
报告人简介:
周睿婷,武汉大学国家网络安全学院副研究员、硕导。2018年在加拿大卡尔加里大学(University of Calgary)获得计算机科学博士学位。香港中文大学计算机科学与工程学系博士后。研究兴趣包括云计算、5G、物联网、机器学习、联邦学习中的优化算法研究。已发表/录用30篇顶级会议及期刊论文,包括IEEE INFOCOM、ACM MOBIHOC、ICPP、IEEE/ACM TON、IEEE TPDS、 IEEE JSAC、 IEEE TMC等。担任IEEE INFOCOM Workshop-ICCN 2019/2020/2021审稿委员会主席,并担任IEEE MSN、Globecom、ICC、IEEE JSAC、TMC、 IEEE/ACM ToN等多个顶级会议期刊审稿人。获得ACM GREENMETRICS 2016最佳学生论文提名奖、中国图灵大会2019最佳论文提名奖、BIGCOM 2020最佳论文奖。入选武汉市黄鹤英才优秀青年人才项目、江苏省苏北发展特聘专家等省级人才项目。