Horovod is an open-source distributed deep learning framework created by Uber. It is designed to make distributed deep learning fast and easy to use. Horovod achieves this by leveraging MPI (Message Passing Interface) or NCCL (NVIDIA Collective Communications Library) to perform efficient communication between nodes, which is crucial for scaling deep learning models across multiple GPUs or machines.
When using Horovod, you might encounter an error message that reads: 'wrong medium type'
. This error typically occurs during the initialization or execution phase of a distributed training job.
The training job fails to start or crashes unexpectedly, and the error log contains the message 'wrong medium type'
. This can be frustrating as it halts the progress of your distributed training.
The 'wrong medium type'
error is indicative of a mismatch between the expected and actual medium types used for communication in the distributed setup. This often relates to the configuration of the network or the libraries used for communication, such as MPI or NCCL.
This error can occur if there is a misconfiguration in the network settings or if the wrong communication library is being used. For example, if MPI is configured to use a network interface that does not support the required communication protocol, this error might arise.
To resolve the 'wrong medium type'
error, follow these steps:
ifconfig
HOROVOD_MPI_THREADS_DISABLE
or NCCL_SOCKET_IFNAME
.mpirun --version
export NCCL_SOCKET_IFNAME=eth0
mpirun
command specifies the correct network interface.By ensuring that the network configuration and communication library settings are correct, you can resolve the 'wrong medium type'
error in Horovod. This will allow your distributed training jobs to run smoothly and efficiently. For more detailed information, consider visiting the Horovod Documentation.
(Perfect for DevOps & SREs)
(Perfect for DevOps & SREs)