Deep Learning(DL) has attracted a lot of attention in recent years, and python has been the front runner language when it comes to the framework and implementation. Training of DL models remains a challenge as it requires a huge amount of time and computational resources. We will discuss the distributed training of the Deep Neural Network using the MPI across multiple GPUs or CPUs.
Deep learning models are a subset of machine learning models and algorithms which are designed to induce Artificial Intelligence in computers. The rise of deep learning can be attributed to the presence of large datasets and growing computational power. Deep learning models are used in face recognition, speech recognition, and many other applications. TensorFlow is a popular deep learning framework for python used to implement and train Deep Neural Networks (DNNs). Message Passing Interface (MPI) is a programming paradigm, often used in parallel applications, that allows processes to communicate with each other. Horovod provides an interface in python to couple DNN written using TensorFlow and MPI to train DNNs in less amount of time using the distributed training approach. MPI functions are optimized to provide multiple communication routines including point-to-point and collective communication. Point-to-point communication refers to a communication pattern that involves a sender process and a receiver process while collective communication involves a group of processes exchanging messages. In particular, the reduction is a collective function widely used in deep learning models to perform group operations. In this talk, we intend to demonstrate the challenges and elements to consider for DNN training using MPI in Python.