Horovod Horovod fails with 'invalid device ordinal'

Using an invalid or non-existent device ID.

Understanding Horovod and Its Purpose

Horovod is an open-source distributed deep learning framework that makes it easy to train models across multiple GPUs and nodes. Developed by Uber, it is designed to improve the speed and efficiency of training large-scale machine learning models. Horovod leverages technologies like MPI (Message Passing Interface) to facilitate communication between different nodes and GPUs, enabling seamless scaling of training processes.

Identifying the Symptom: 'Invalid Device Ordinal'

When using Horovod, you might encounter the error message: invalid device ordinal. This error typically occurs during the initialization phase of your distributed training job, and it indicates that there is an issue with the device IDs being used by Horovod.

Exploring the Issue: What Does 'Invalid Device Ordinal' Mean?

The 'invalid device ordinal' error suggests that Horovod is attempting to access a GPU device that does not exist or is not available on the system. This can happen if the device IDs specified in your configuration do not match the actual IDs of the GPUs on your machine. It is crucial to ensure that the device IDs are correctly mapped to the available hardware.

Common Causes of the Error

  • Incorrect device ID specified in the Horovod configuration.
  • GPU devices are not properly initialized or recognized by the system.
  • Mismatch between the number of GPUs specified and the actual number of GPUs available.

Steps to Fix the 'Invalid Device Ordinal' Issue

To resolve this issue, follow these steps to ensure that your device IDs are correctly configured:

Step 1: Verify Available GPU Devices

First, check the available GPU devices on your system. You can use the nvidia-smi command to list all GPUs and their IDs:

nvidia-smi

This command will display a list of all available GPUs along with their device IDs. Ensure that the IDs you plan to use in your Horovod configuration match those listed by nvidia-smi.

Step 2: Update Horovod Configuration

Once you have verified the available GPUs, update your Horovod configuration to use the correct device IDs. This can typically be done by setting the CUDA_VISIBLE_DEVICES environment variable before running your training script:

export CUDA_VISIBLE_DEVICES=0,1,2,3

Replace 0,1,2,3 with the actual IDs of the GPUs you wish to use.

Step 3: Restart the Training Job

After updating the configuration, restart your Horovod training job. Ensure that the environment variable is correctly set and that the training script is using the updated configuration.

Additional Resources

For more information on configuring Horovod and troubleshooting common issues, consider visiting the following resources:

By following these steps and utilizing the resources provided, you should be able to resolve the 'invalid device ordinal' error and continue with your distributed training using Horovod.

Master

Horovod

in Minutes — Grab the Ultimate Cheatsheet

(Perfect for DevOps & SREs)

Most-used commands
Real-world configs/examples
Handy troubleshooting shortcuts
Your email is safe with us. No spam, ever.

Thankyou for your submission

We have sent the cheatsheet on your email!
Oops! Something went wrong while submitting the form.

Horovod

Cheatsheet

(Perfect for DevOps & SREs)

Most-used commands
Your email is safe with us. No spam, ever.

Thankyou for your submission

We have sent the cheatsheet on your email!
Oops! Something went wrong while submitting the form.

MORE ISSUES

Made with ❤️ in Bangalore & San Francisco 🏢

Doctor Droid