Exploring Optimization Methods for Distributed Machine Learning Models
College:
The Dorothy and George Hennings College of Science, Mathematics, and Technology
Major:
Computer Science
Faculty Research Advisor(s):
Yulia Kumar
Abstract:
With the proliferation of machine learning (ML) applications, the scalability of training processes has become paramount. However, the escalating size of datasets coupled with limitations in node computing and storage, exacerbated by the slowdown in Dennard scaling, has rendered traditional training approaches inadequate. Both computational and communication bottlenecks impede the efficiency of distributed training frameworks, necessitating novel strategies to enhance scalability and convergence. The study addresses these challenges and aims to explore, implement, and compare various approaches for the optimization of distributed training of neural networks, aiming to improve training efficiency and scalability. This study addresses these challenges and aims to make the following contributions: 1) Optimization of Computational and Communication resources. To explore techniques that optimize computational and communication resource utilization in distributed training systems and study the trade-off between performance and accuracy. 2) Evaluation of Performance and Scalability: To conduct evaluations to assess the performance and scalability of the proposed methodologies. By addressing these objectives, this research contributes to the ongoing efforts to optimize distributed training for large-scale NNs.