Horovod Horovod fails with 'address already in use'

Port conflict with another process.

Understanding Horovod

Horovod is an open-source distributed deep learning framework that makes it easy to train models across multiple GPUs and nodes. Developed by Uber, it is designed to improve the speed and efficiency of training deep learning models by leveraging data parallelism. Horovod integrates seamlessly with popular deep learning frameworks like TensorFlow, Keras, and PyTorch, enabling users to scale their training workloads with minimal code changes.

Identifying the Symptom

When using Horovod, you might encounter an error message stating: 'address already in use'. This error typically occurs during the initialization phase of a distributed training session. It indicates that Horovod is unable to bind to a network port because the port is already occupied by another process.

Explaining the Issue

The 'address already in use' error is a common networking issue that arises when multiple processes attempt to use the same network port simultaneously. In the context of Horovod, this can happen if:

  • Another instance of Horovod or a different application is already using the port.
  • The previous Horovod process did not terminate cleanly, leaving the port in use.

Port Allocation in Horovod

Horovod uses network ports to facilitate communication between different nodes in a distributed training setup. By default, it selects ports dynamically, but conflicts can still occur if the chosen port is already occupied.

Steps to Resolve the Issue

To resolve the 'address already in use' error in Horovod, follow these steps:

Step 1: Identify the Conflicting Process

Use the following command to identify which process is using the port:

lsof -i :

This command lists all processes using the specified port. Replace with the actual port number reported in the error message.

Step 2: Terminate the Conflicting Process

Once you have identified the process, you can terminate it using the kill command:

kill -9

Replace with the ID of the process using the port.

Step 3: Configure Horovod to Use a Different Port

If terminating the conflicting process is not an option, configure Horovod to use a different port. You can specify a range of ports for Horovod to use by setting the HOROVOD_GLOO_RENDEZVOUS_ADDR and HOROVOD_GLOO_RENDEZVOUS_PORT environment variables:

export HOROVOD_GLOO_RENDEZVOUS_ADDR=localhost
export HOROVOD_GLOO_RENDEZVOUS_PORT=

Replace with a port number that is not in use.

Additional Resources

For more information on configuring Horovod and troubleshooting common issues, refer to the following resources:

Master

Horovod

in Minutes — Grab the Ultimate Cheatsheet

(Perfect for DevOps & SREs)

Most-used commands
Real-world configs/examples
Handy troubleshooting shortcuts
Your email is safe with us. No spam, ever.

Thankyou for your submission

We have sent the cheatsheet on your email!
Oops! Something went wrong while submitting the form.

Horovod

Cheatsheet

(Perfect for DevOps & SREs)

Most-used commands
Your email is safe with us. No spam, ever.

Thankyou for your submission

We have sent the cheatsheet on your email!
Oops! Something went wrong while submitting the form.

MORE ISSUES

Made with ❤️ in Bangalore & San Francisco 🏢

Doctor Droid