I have a program which uses MPI::Comm::Spawn to start processes on
compute nodes (c0-0, c0-1, etc). The communication between the
compute nodes consists of ISend and IRecv pairs, while communication
between the compute nodes consists of gather and bcast operations.
After executing ~80 successful loops (gather/bcast pairs), I get this
error message from the head node process during a gather call:
from headnode.local to: c0-0 error polling HP CQ with status WORK
REQUEST FLUSHED ERROR status number 5 for wr_id 18504944 opcode 1
The relevant environment variables:
If rd_low and rd_num are left at their default values, the program
simply hangs in the gather call after about 20 iterations (a gather
and a bcast).
Can anyone shed any light on what this error message means or what
might be done about it?