Special Track on Reliable Parallel Machine Learning Systems (ReML 2024)

Description

Scaling laws for artificial intelligent (AI) models underpin the expansion and enhancement of AI capabilities, empowering systems to learn and respond with unprecedented accuracy and granularity. As the scale of the AI computing cluster increases, ensuring the reliability of such large-scale systems for training and inference is facing great challenges.

On one hand, both training and inference with large-scale model rely on complex interconnected systems. Different parts, processors, distributed nodes have strong interactions and dependency in distinct parallel domains. As the system grows in size, the risk of data inconsistency, deadlocks, starvation rises. Any network interruptions or slow responses may cause the entire system to enter a sub-health state, or even a complete outage. The reliability bottleneck becomes a crucial aspect to be addressed.

On the other hand, due to the black-box nature of many large-scale models, it is hard to identify and analyze faulty phenomenon under various software and hardware configurations. Soft errors and computational failures may manifest themselves in different forms, posing new challenges to fault management as well as higher requirements for the implementation of fault-tolerant systems.

This workshop focuses on the reliability of complex machine learning systems. Relevant discussions may range from attempts to combine theory and engineering of distributed parallel systems to enhance system resilience, to more general approaches for modelling and optimizing large-scale AI training and inference built upon them.

Topics

The list of topics includes, but is not limited to:

Modelling and Optimizing of Parallel Training and Inference of Transformers
Fault Management of Parallel Machine Learning Systems
Fail-Slow, Fail-Stop Detection and Prediction
High Performance Checkpointing Technique
Elastic and Resilient Training and Inference
Convergence problem (loss spike) and Training Algorithm

Submission

Authors are invited to submit original unpublished research papers as well as industrial practice papers. Simultaneous submissions to other conferences are not permitted. Detailed instructions for electronic paper submission, panel proposals, and review process can be found at QRS submission.

Each submission can have a maximum of ten pages. It should include a title, the name and affiliation of each author, a 300-word abstract, and up to 6 keywords. Shorter version papers (up to six pages) are also allowed.

All papers must conform to the QRS conference proceedings format (PDF | Word DOCX | Latex) and Submission Guideline set in advance by QRS 2024. At least one of the authors of each accepted paper is required to pay the full registration fee and present the paper at the workshop. Submissions must be in PDF format and uploaded to the conference submission site. Arrangements are being made to publish extended version of top-quality papers in selected SCI journals.

Submission

Program Co-Chair

Ke Tang

Southern University of Science and Technology, China

Program Committee

Name	Affiliation	Geographic Region
Xiaowen Chu	The Hong Kong University of Science and Technology (Guangzhou)	China
Jianhui Jiang	Tongji University	China
Jingwen Leng	Shanghai Jiao Tong University	China
Guiying Li	Southern University of Science and Technology	China
Shaohuai Shi	Harbin Institute of Technology	China

Important Dates

May 20, 2024 extended
Workshop papers due
June 1, 2024
Author registration due
June 1, 2024
Camera-ready due
July 1 - 5, 2024
Conference dates

General Inquiries

QRS Secretariat

Special Track on Reliable Parallel Machine Learning Systems (ReML 2024)

Description

Topics

Submission

Program Chair

Yang Zheng

Program Co-Chair

Ke Tang

Program Committee

Special Track on Reliable Parallel Machine Learning Systems (ReML 2024)

Description

Topics

Submission

Program Chair

Yang Zheng

Program Co-Chair

Ke Tang

Program Committee

Sponsored by

Conference Patrons