Horovod Horovod fails with 'address already in use'
Port conflict with another process.
Stuck? Let AI directly find root cause
AI that integrates with your stack & debugs automatically | Runs locally and privately
What is Horovod Horovod fails with 'address already in use'
Understanding Horovod
Horovod is an open-source distributed deep learning framework that makes it easy to train models across multiple GPUs and nodes. Developed by Uber, it is designed to improve the speed and efficiency of training deep learning models by leveraging data parallelism. Horovod integrates seamlessly with popular deep learning frameworks like TensorFlow, Keras, and PyTorch, enabling users to scale their training workloads with minimal code changes.
Identifying the Symptom
When using Horovod, you might encounter an error message stating: 'address already in use'. This error typically occurs during the initialization phase of a distributed training session. It indicates that Horovod is unable to bind to a network port because the port is already occupied by another process.
Explaining the Issue
The 'address already in use' error is a common networking issue that arises when multiple processes attempt to use the same network port simultaneously. In the context of Horovod, this can happen if:
Another instance of Horovod or a different application is already using the port. The previous Horovod process did not terminate cleanly, leaving the port in use.
Port Allocation in Horovod
Horovod uses network ports to facilitate communication between different nodes in a distributed training setup. By default, it selects ports dynamically, but conflicts can still occur if the chosen port is already occupied.
Steps to Resolve the Issue
To resolve the 'address already in use' error in Horovod, follow these steps:
Step 1: Identify the Conflicting Process
Use the following command to identify which process is using the port:
lsof -i :
This command lists all processes using the specified port. Replace with the actual port number reported in the error message.
Step 2: Terminate the Conflicting Process
Once you have identified the process, you can terminate it using the kill command:
kill -9
Replace with the ID of the process using the port.
Step 3: Configure Horovod to Use a Different Port
If terminating the conflicting process is not an option, configure Horovod to use a different port. You can specify a range of ports for Horovod to use by setting the HOROVOD_GLOO_RENDEZVOUS_ADDR and HOROVOD_GLOO_RENDEZVOUS_PORT environment variables:
export HOROVOD_GLOO_RENDEZVOUS_ADDR=localhostexport HOROVOD_GLOO_RENDEZVOUS_PORT=
Replace with a port number that is not in use.
Additional Resources
For more information on configuring Horovod and troubleshooting common issues, refer to the following resources:
Horovod Official Documentation Horovod GitHub Repository Horovod Questions on Stack Overflow
Horovod Horovod fails with 'address already in use'
TensorFlow
- 80+ monitoring tool integrations
- Long term memory about your stack
- Locally run Mac App available
Time to stop copy pasting your errors onto Google!