Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] bizarre failure with IMB/openib
From: Peter Kjellström (cap_at_[hidden])
Date: 2011-03-21 10:04:39


On Monday, March 21, 2011 12:25:37 pm Dave Love wrote:
> I'm trying to test some new nodes with ConnectX adaptors, and failing to
> get (so far just) IMB to run on them.
...
> I'm using gcc-compiled OMPI 1.4.3 and the current RedHat 5 OFED with IMB
> 3.2.2, specifying `btl openib,sm,self' (or `mtl psm' on the Qlogic
> nodes). I'm not sure what else might be relevant. The output from
> trying to run IMB follows, for what it's worth.
>
>
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for MPI
> communications. This means that no Open MPI device has indicated that it
> can be used to communicate between these processes. This is an error;
> Open MPI requires that all MPI processes be able to reach each other.
> This error can sometimes be the result of forgetting to specify the "self"
> BTL.
>
> Process 1 ([[25307,1],2]) is on host: lvgig116
> Process 2 ([[25307,1],12]) is on host: lvgig117
> BTLs attempted: self sm

Are you sure you launched it correctly and that you have (re)built OpenMPI
against your Redhat-5 ib stack?
 
> Your MPI job is now going to abort; sorry.
...
> [lvgig116:07931] 19 more processes have sent help message
> help-mca-bml-r2.txt / unreachable proc [lvgig116:07931] Set MCA parameter

Seems to me that OpenMPI gave up because it didn't succeed in initializing any
inter-node btl/mtl.

I'd suggest you try (roughly in order):

 1) ibstat on all nodes to verify that your ib interfaces are up
 2) try a verbs level test (like ib_write_bw) to verify data can flow
 3) make sure your OpenMPI was built with the redhat libibverbs-devel present
    (=> a suitable openib btl is built).

/Peter

> "orte_base_help_aggregate" to 0 to see all help / error messages
> [lvgig116:07931] 19 more processes have sent help message help-mpi-runtime
> / mpi_init:startup:internal-failure