Paper
16 January 2025 HeterSim: enable flexible simulation for heterogeneous distributed deep learning platform
Yinan Tang, Hongwei Zhang, Li Wang, Zhenhua Guo, Yaqian Zhao, Rengang Li
Author Affiliations +
Proceedings Volume 13447, International Conference on Mechatronics and Intelligent Control (ICMIC 2024); 134473A (2025) https://doi.org/10.1117/12.3047645
Event: International Conference on Mechatronics and Intelligent Control (ICMIC 2024), 2024, Wuhan, China
Abstract
Distributed deep learning has become an essential technique for accelerating deep learning, but its performance is often influenced by the heterogeneous computing nodes and heterogeneous communication networks within the distributed deep learning platform. Due to the high costs of practical deployment and running of distributed deep learning, it is almost impossible for researchers to optimize the training strategy of distributed deep learning tasks in real-world environments. In this paper, we propose HeterSim, a simulator specifically designed for heterogeneous distributed deep learning platforms. HeterSim enables flexible configuration of computing node performance and network connections, and supports to define and simulate distributed deep learning workloads using graph-based representations. HeterSim also allows for the modification of communication strategies in the distributed deep learning process, thereby assisting researchers in validating their designs for distributed deep learning. We verify the feasibility and flexibility of HeterSim, by generating a simulation platform at the scale of millions of nodes, and successfully simulate the distributed deep learning process of Resnet50. We aim to provide HeterSim as a flexible and user-friendly simulator for researchers, targeting heterogeneous distributed deep learning platforms, and helping researchers evaluate and optimize the strategy of distributed deep learning tasks at a lower cost.
(2025) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Yinan Tang, Hongwei Zhang, Li Wang, Zhenhua Guo, Yaqian Zhao, and Rengang Li "HeterSim: enable flexible simulation for heterogeneous distributed deep learning platform", Proc. SPIE 13447, International Conference on Mechatronics and Intelligent Control (ICMIC 2024), 134473A (16 January 2025); https://doi.org/10.1117/12.3047645
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Computer simulations

Switches

Deep learning

Education and training

Data modeling

Modeling

Computation time

Back to Top