What is Horovod Horovod fails with 'protocol error'

Understanding Horovod

Horovod is an open-source distributed deep learning framework developed by Uber. It is designed to make distributed deep learning fast and easy to use. Horovod achieves this by leveraging the Message Passing Interface (MPI) and NVIDIA NCCL libraries to optimize communication between nodes in a cluster, allowing for efficient scaling of training processes.

Identifying the Symptom

When using Horovod, you might encounter an error message that reads: protocol error. This error typically appears during the initialization phase of the Horovod processes and can prevent the distributed training from starting successfully.

Explaining the Issue

What Causes a Protocol Error?

The 'protocol error' in Horovod usually indicates a mismatch in the communication protocol being used by the different processes involved in the distributed training. This mismatch can occur if the processes are not configured to use the same protocol, leading to communication failures.

Common Scenarios

This issue often arises when there are inconsistencies in the environment setup across different nodes, such as different versions of MPI or NCCL, or when the environment variables are not correctly set.

Steps to Resolve the Protocol Error

Step 1: Verify Environment Consistency

Ensure that all nodes in your cluster have the same versions of Horovod, MPI, and NCCL installed. You can check the versions by running:

mpirun --versionhorovodrun --version

Make sure these commands return the same version numbers on all nodes.

Step 2: Set Communication Protocol

Explicitly set the communication protocol to ensure consistency. You can do this by setting the HOROVOD_MPI_THREADS_DISABLE environment variable to 1 to disable MPI threads, which can sometimes resolve protocol mismatches:

export HOROVOD_MPI_THREADS_DISABLE=1

Step 3: Check Network Configuration

Ensure that all nodes can communicate with each other over the network. Check firewall settings and ensure that the necessary ports are open for communication. You can test connectivity using:

ping <node-ip>

Step 4: Use Consistent Launch Commands

When launching your Horovod job, ensure that you use the same command across all nodes. A typical command might look like:

horovodrun -np 4 -H host1:2,host2:2 python train.py

Ensure that the host list and process count are consistent with your cluster setup.

Horovod Horovod fails with 'protocol error'

Stuck? Let AI directly find root cause