Horovod Horovod cannot find NCCL

NCCL is not installed or not in the system PATH.

Understanding Horovod and Its Purpose

Horovod is an open-source distributed deep learning framework that makes it easy to train models across multiple GPUs and nodes. It is designed to improve the speed and efficiency of training large-scale machine learning models by leveraging data parallelism. Horovod is particularly popular in environments where scaling out training jobs is crucial, such as in research and production settings.

Identifying the Symptom: Horovod Cannot Find NCCL

When using Horovod, you may encounter an error message indicating that it cannot find NCCL. This symptom typically manifests during the initialization phase of a distributed training job, where Horovod attempts to utilize NVIDIA Collective Communications Library (NCCL) for efficient multi-GPU communication.

Exploring the Issue: Why Horovod Needs NCCL

NCCL is a library developed by NVIDIA that provides high-performance primitives for collective communication, such as all-reduce and broadcast, which are essential for distributed training. If Horovod cannot locate NCCL, it is likely due to NCCL not being installed or its library path not being included in the system PATH. This prevents Horovod from leveraging NCCL's capabilities, leading to suboptimal performance or failure to execute distributed training tasks.

Common Error Messages

  • horovodrun: error: NCCL library not found
  • ImportError: Horovod requires NCCL to be installed

Steps to Resolve the Issue

To resolve the issue of Horovod not finding NCCL, follow these steps:

Step 1: Install NCCL

If NCCL is not installed, you need to download and install it. You can find the installation instructions on the NVIDIA NCCL download page. Follow the instructions specific to your operating system and CUDA version.

Step 2: Verify NCCL Installation

After installation, verify that NCCL is correctly installed by running the following command:

dpkg -l | grep nccl

This command should list the installed NCCL packages.

Step 3: Update System PATH

Ensure that the NCCL library path is included in your system PATH. You can do this by adding the following lines to your ~/.bashrc or ~/.bash_profile file:

export LD_LIBRARY_PATH=/usr/local/nccl/lib:$LD_LIBRARY_PATH

Replace /usr/local/nccl/lib with the actual path where NCCL is installed on your system.

Step 4: Rebuild Horovod

If you installed NCCL after building Horovod, you might need to rebuild Horovod to ensure it links against the NCCL library. Run the following command to rebuild Horovod:

HOROVOD_NCCL_HOME=/usr/local/nccl pip install --no-cache-dir horovod

Ensure that /usr/local/nccl is replaced with the correct NCCL installation path.

Conclusion

By following these steps, you should be able to resolve the issue of Horovod not finding NCCL. Proper installation and configuration of NCCL are crucial for leveraging the full potential of Horovod in distributed training scenarios. For more detailed information, refer to the Horovod documentation.

Master

Horovod

in Minutes — Grab the Ultimate Cheatsheet

(Perfect for DevOps & SREs)

Most-used commands
Real-world configs/examples
Handy troubleshooting shortcuts
Your email is safe with us. No spam, ever.

Thankyou for your submission

We have sent the cheatsheet on your email!
Oops! Something went wrong while submitting the form.

Horovod

Cheatsheet

(Perfect for DevOps & SREs)

Most-used commands
Your email is safe with us. No spam, ever.

Thankyou for your submission

We have sent the cheatsheet on your email!
Oops! Something went wrong while submitting the form.

MORE ISSUES

Made with ❤️ in Bangalore & San Francisco 🏢

Doctor Droid