Special Track on Reliable Parallel Machine Learning Systems (ReML 2024)


Description


Scaling laws for artificial intelligent (AI) models underpin the expansion and enhancement of AI capabilities, empowering systems to learn and respond with unprecedented accuracy and granularity. As the scale of the AI computing cluster increases, ensuring the reliability of such large-scale systems for training and inference is facing great challenges.

On one hand, both training and inference with large-scale model rely on complex interconnected systems. Different parts, processors, distributed nodes have strong interactions and dependency in distinct parallel domains. As the system grows in size, the risk of data inconsistency, deadlocks, starvation rises. Any network interruptions or slow responses may cause the entire system to enter a sub-health state, or even a complete outage. The reliability bottleneck becomes a crucial aspect to be addressed.

On the other hand, due to the black-box nature of many large-scale models, it is hard to identify and analyze faulty phenomenon under various software and hardware configurations. Soft errors and computational failures may manifest themselves in different forms, posing new challenges to fault management as well as higher requirements for the implementation of fault-tolerant systems.

This workshop focuses on the reliability of complex machine learning systems. Relevant discussions may range from attempts to combine theory and engineering of distributed parallel systems to enhance system resilience, to more general approaches for modelling and optimizing large-scale AI training and inference built upon them.

Topics


The list of topics includes, but is not limited to:

  • Modelling and Optimizing of Parallel Training and Inference of Transformers
  • Fault Management of Parallel Machine Learning Systems
  • Fail-Slow, Fail-Stop Detection and Prediction
  • High Performance Checkpointing Technique
  • Elastic and Resilient Training and Inference
  • Convergence problem (loss spike) and Training Algorithm

Submission


Authors are invited to submit original unpublished research papers as well as industrial practice papers. Simultaneous submissions to other conferences are not permitted. Detailed instructions for electronic paper submission, panel proposals, and review process can be found at QRS submission.

Each submission can have a maximum of ten pages. It should include a title, the name and affiliation of each author, a 300-word abstract, and up to 6 keywords. Shorter version papers (up to six pages) are also allowed.

All papers must conform to the QRS conference proceedings format (PDF | Word DOCX | Latex) and Submission Guideline set in advance by QRS 2024. At least one of the authors of each accepted paper is required to pay the full registration fee and present the paper at the workshop. Submissions must be in PDF format and uploaded to the conference submission site. Arrangements are being made to publish extended version of top-quality papers in selected SCI journals.

Submission

Program Co-Chair


Ke Tang's avatar
Ke Tang

Southern University of Science and Technology, China

Program Committee


TBA

Name Affiliation Geographic Region
Xiao Chen Huawei Technologies Co., Ltd China
Xiaowen Chu The Hong Kong University of Science and Technology
(Guangzhou)
China
Zheng Hu Huawei Technologies Co., Ltd China
Chengqiang Huang Huawei Technologies Co., Ltd China
Jianhui Jiang Tongji University China
Jingwen Leng Shanghai Jiao Tong University China
Guiying Li Southern University of Science and Technology China
Shaohuai Shi Harbin Institute of Technology China
Ke Tang Southern University of Science and Technology China