On Thu, Jun 25, 2009 at 10:29:39AM -0700, D'Auria, Raffaella wrote:
> Dear All,
> I have been encountering a fatal type "error polling LP CQ with status
> RETRY EXCEEDED ERROR status number 12" whenever I try to run a simple
> MPI code (see below) that performs an AlltoAll call.
> We are running the OpenMPI 1.3.2 stack on top of the OFED 1.4.1 stack.
> Our cluster is composed of mostly Mellanox HCAs (MT_03B0140001) and
> some Qlogic (InfiniPath_QLE724) cards.
> The problem manifests itself as soon as the size of the vector, which
> components are being swapped between processes with the all to all
> call, is equal or larger than 68MB.
> Please note that I have this problem only when at least one of the
> computational nodes in the host list of mpiexec is a node with the
> qlogic card InfiniPath_QLE724.
Look at btl flags....
It is possible that the InfiniPath_QLE7240 fast transport path for MPI is not
connecting to the Mellanox HCA. The default fast path for cards
like the QLE7240 use the PSM library that Mellanox does not know about.
The mpirun man page hints at this but does not divulge what btl is
and how to expore the modular component archecture (MCA).
T o m M i t c h e l l
Found me a new hat, now what?