Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] OMPI-1.3.2, openib/iWARP(cxgb3) problem: PML add procs failed (Unreachable)
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-05-07 07:57:47

On May 6, 2009, at 4:45 PM, Ken Cain wrote:

> Is it possible for OMPI to generate output at runtime indicating
> exactly
> what btl(s) will be used?

At present, we only have a fairly lame system to do this. We wanted
to print out a connection map in v1.3, but it didn't happen -- this
feature has been re-targeted for v1.5:

It's unfortunately a surprisingly complex issue; one reason that it's
"hard" is that OMPI lazily makes connections and supports striping
across multiple networks. Hence, to make a completely accurate map,
OMPI has to guarantee to make *all* network connections and then
gather all the connection information back to MPI_COMM_WORLD rank 0 to
print out.

What OMPI does today is that if you specifically ask for a high-speed
network and we're unable to find one, we'll warn about it (because if
you asked for it, you likely really want to use it -- if there isn't
one, that's likely a problem). So if you:

   mpirun --mca btl openib,sm,self,tcp ...

And OMPI doesn't find any active OpenFabrics ports, it'll print a

> Removing tcp below brings me back to the original simple command line
> that fails with the output shown above (indicating that openib btl
> will
> be disabled):
> mpirun --mca orte_base_help_aggregate 0 --mca btl openib,self --
> hostfile
> ~/1usrv_ompi_machfile -np 2 ./NPmpi -p0 -l 1 -u 1024

It looks like you're having two problems:

1. The RDMACM connector in OMPI decides that it can't be used:

mpirun --mca orte_base_help_aggregate 0 --mca btl openib,self --hostfile
~/1usrv_ompi_machfile -np 2 ./NPmpi -p0 -l 1 -u 1024 > outfile1 2>&1

> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port. As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
> Local host: aae1
> Local device: cxgb3_0
> CPCs attempted: oob, xoob, rdmacm

*** Can you re-run this scenario with --mca btl_base_verbose 50? I'd
like to see why the RDMA CM CPC disqualified itself.

2. But if you specify the port and to only use the rdmacm connector
(CPC), the RDMA CM CPC *does* become available (which is just weird --
I don't know why that would be different than the above case...), but
then it decides that it cannot connect:

mpirun --mca orte_base_help_aggregate 0 --mca btl openib,self,sm --mca
btl_base_verbose 10 --mca btl_openib_verbose 10 --mca
btl_openib_if_include cxgb3_0:1 --mca btl_openib_cpc_include rdmacm
--mca btl_openib_device_type iwarp --mca btl_openib_max_btls 1 --mca
mpi_leave_pinned 1 --hostfile ~/1usrv_ompi_machfile -np 2 ./NPmpi -p0 -l
1 -u 1024 > ~/outfile2 2>&1

>...lots of output...
> [aae4:19426] openib BTL: rdmacm CPC available for use on cxgb3_0
>...lots of output...
> At least one pair of MPI processes are unable to reach each other for
> MPI communications. This means that no Open MPI device has indicated
> that it can be used to communicate between these processes. This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other. This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> Process 1 ([[3107,1],0]) is on host: aae4
> Process 2 ([[3107,1],1]) is on host: aae1
> BTLs attempted: openib self sm
> Your MPI job is now going to abort; sorry.

*** Very strange. Can you send the output ibv_devinfo -v from both

Jeff Squyres
Cisco Systems