Horovod Horovod fails with 'wrong medium type'

Attempting to access a medium with an incorrect type.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
What is

Horovod Horovod fails with 'wrong medium type'

 ?

Understanding Horovod

Horovod is an open-source distributed deep learning framework created by Uber. It is designed to make distributed deep learning fast and easy to use. Horovod achieves this by leveraging MPI (Message Passing Interface) or NCCL (NVIDIA Collective Communications Library) to perform efficient communication between nodes, which is crucial for scaling deep learning models across multiple GPUs or machines.

Identifying the Symptom

When using Horovod, you might encounter an error message that reads: 'wrong medium type'. This error typically occurs during the initialization or execution phase of a distributed training job.

What You Observe

The training job fails to start or crashes unexpectedly, and the error log contains the message 'wrong medium type'. This can be frustrating as it halts the progress of your distributed training.

Explaining the Issue

The 'wrong medium type' error is indicative of a mismatch between the expected and actual medium types used for communication in the distributed setup. This often relates to the configuration of the network or the libraries used for communication, such as MPI or NCCL.

Root Cause Analysis

This error can occur if there is a misconfiguration in the network settings or if the wrong communication library is being used. For example, if MPI is configured to use a network interface that does not support the required communication protocol, this error might arise.

Steps to Fix the Issue

To resolve the 'wrong medium type' error, follow these steps:

1. Verify Network Configuration

  • Ensure that the network interfaces are correctly configured and accessible. You can list available network interfaces using the command:
    ifconfig
  • Check that the correct interface is being used by MPI or NCCL. This can often be set using environment variables like HOROVOD_MPI_THREADS_DISABLE or NCCL_SOCKET_IFNAME.

2. Check Communication Library Settings

  • Ensure that the correct communication library is installed and configured. For MPI, verify the installation with:
    mpirun --version
  • For NCCL, ensure that it is properly installed and configured. You can test NCCL with the NCCL tests.

3. Update Environment Variables

  • Set the appropriate environment variables to specify the correct medium type. For example, you can set the network interface for NCCL using:
    export NCCL_SOCKET_IFNAME=eth0
  • For MPI, ensure that the mpirun command specifies the correct network interface.

Conclusion

By ensuring that the network configuration and communication library settings are correct, you can resolve the 'wrong medium type' error in Horovod. This will allow your distributed training jobs to run smoothly and efficiently. For more detailed information, consider visiting the Horovod Documentation.

Attached error: 
Horovod Horovod fails with 'wrong medium type'
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Master 

Horovod

 debugging in Minutes

— Grab the Ultimate Cheatsheet

(Perfect for DevOps & SREs)

Most-used commands
Real-world configs/examples
Handy troubleshooting shortcuts
Your email is safe with us. No spam, ever.

Thankyou for your submission

We have sent the cheatsheet on your email!
Oops! Something went wrong while submitting the form.

Horovod

Cheatsheet

(Perfect for DevOps & SREs)

Most-used commands
Your email is safe thing.

Thankyou for your submission

We have sent the cheatsheet on your email!
Oops! Something went wrong while submitting the form.

MORE ISSUES

Deep Sea Tech Inc. — Made with ❤️ in Bangalore & San Francisco 🏢

Doctor Droid