Horovod Horovod fails with 'invalid device function'

Mismatch between the compiled CUDA code and the GPU architecture.

Understanding Horovod and Its Purpose

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. It is designed to make distributed Deep Learning fast and easy to use. By leveraging the Horovod GitHub repository, developers can efficiently scale their training workloads across multiple GPUs and nodes.

Identifying the Symptom: 'Invalid Device Function'

When using Horovod, you might encounter the error message: invalid device function. This error typically arises during the execution of a distributed training job, causing the process to fail unexpectedly.

Exploring the Issue: Mismatch in CUDA Code and GPU Architecture

The 'invalid device function' error often indicates a mismatch between the compiled CUDA code and the GPU architecture. This occurs when the CUDA kernels are not compiled for the specific architecture of the GPU being used. For instance, if the code is compiled for an older architecture, it may not run on newer GPUs.

Understanding CUDA Architectures

CUDA architectures, also known as compute capabilities, define the features supported by a GPU. Each GPU model supports specific compute capabilities, and the CUDA code must be compiled to target these capabilities. You can find a list of compute capabilities for different GPUs on the NVIDIA CUDA GPUs page.

Steps to Fix the Issue

Step 1: Identify Your GPU Architecture

First, determine the compute capability of your GPU. You can do this by running the following command:

nvidia-smi

This command will provide details about your GPU, including its model. Cross-reference this with the NVIDIA CUDA GPUs page to find the compute capability.

Step 2: Compile CUDA Code for the Correct Architecture

Ensure that your CUDA code is compiled for the correct architecture. You can specify the target architecture using the -arch flag when compiling your CUDA code. For example, if your GPU has a compute capability of 7.5, use:

nvcc -arch=sm_75 -o my_program my_program.cu

This command compiles the CUDA code for the specified architecture.

Step 3: Rebuild Horovod with Correct CUDA Support

If Horovod itself needs to be rebuilt, ensure that it is compiled with the correct CUDA support. You can do this by setting the HOROVOD_CUDA_ARCH environment variable before installation:

HOROVOD_CUDA_ARCH=sm_75 pip install horovod

This command ensures Horovod is built to support the specified GPU architecture.

Conclusion

By ensuring that your CUDA code and Horovod are compiled for the correct GPU architecture, you can resolve the 'invalid device function' error. For further assistance, consider visiting the Horovod Documentation for more detailed guidance.

Master

Horovod

in Minutes — Grab the Ultimate Cheatsheet

(Perfect for DevOps & SREs)

Most-used commands
Real-world configs/examples
Handy troubleshooting shortcuts
Your email is safe with us. No spam, ever.

Thankyou for your submission

We have sent the cheatsheet on your email!
Oops! Something went wrong while submitting the form.

Horovod

Cheatsheet

(Perfect for DevOps & SREs)

Most-used commands
Your email is safe with us. No spam, ever.

Thankyou for your submission

We have sent the cheatsheet on your email!
Oops! Something went wrong while submitting the form.

MORE ISSUES

Made with ❤️ in Bangalore & San Francisco 🏢

Doctor Droid