Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] running problem on Dell blade server, confirm 2d21ce3ce8be64d8104b3ad71b8c59e2514a72eb
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-04-29 07:06:58


On Apr 25, 2009, at 11:59 AM, Anton Starikov wrote:

> I can confirm that I have exactly the same problem, also on Dell
> system, even with latest openpmpi.
>
> Our system is:
>
> Dell M905
> OpenSUSE 11.1
> kernel: 2.6.27.21-0.1-default
> ofed-1.4-21.12 from SUSE repositories.
> OpenMPI-1.3.2
>
>
> But what I can also add, it not only affect openmpi, if this messages
> are triggered after mpirun:
> [node032][[9340,1],11][btl_openib_component.c:3002:poll_device] error
> polling HP CQ with -2 errno says Success
>
> Then IB stack hangs. You cannot even reload it, have to reboot node.
>

Something that severe should not be able to be caused by Open MPI.
Specifically: Open MPI should not be able to hang the OFED stack.
Have you run layer 0 diagnostics to know that your fabric is clean?
You might want to contact your IB vendor to find out how to do that.

-- 
Jeff Squyres
Cisco Systems