Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OMPI-1.3.2, openib/iWARP(cxgb3) problem: PML add procs failed (Unreachable)
From: Jon Mason (jon_at_[hidden])
Date: 2009-05-06 12:22:25


On Wed, May 06, 2009 at 12:15:19PM -0400, Ken Cain wrote:
> I am trying to run NetPIPE-3.7.1 NPmpi using Open MPI version 1.3.2 with
> the openib btl in an OFED-1.4 environment. The system environment is two
> Linux (2.6.27) ppc64 blades, each with one Chelsio RNIC device,
> interconnected by a 10GbE switch. The problem is that I cannot (using
> Open MPI) establish connections between the two MPI ranks.
>
> I have already read the OMPI FAQ entries and searched for similar
> problems reported to this email list without success. I do have a
> compressed config.log that I can provide separately (it is 80KB in size
> so I'll spare everyone here). I also have the output of ompi_info --all
> that I can share.
>
> I can successfully run small diagnostic programs such as rping,
> ib_rdma_bw, ib_rdma_lat, etc. between the same two blades. I can also
> run NPmpi using another MPI library (MVAPICH2) and the Chelsio/iWARP
> interface.
>
> Here is the one example mpirun command line I used:
> mpirun --mca orte_base_help_aggregate 0 --mca btl openib,self --hostfile
> ~/1usrv_ompi_machfile -np 2 ./NPmpi -p0 -l 1 -u 1024 > outfile1 2>&1
>
> and its output:
>> --------------------------------------------------------------------------
>> No OpenFabrics connection schemes reported that they were able to be
>> used on a specific port. As such, the openib BTL (OpenFabrics
>> support) will be disabled for this port.
>>
>> Local host: aae1
>> Local device: cxgb3_0
>> CPCs attempted: oob, xoob, rdmacm
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> No OpenFabrics connection schemes reported that they were able to be
>> used on a specific port. As such, the openib BTL (OpenFabrics
>> support) will be disabled for this port.
>>
>> Local host: aae4
>> Local device: cxgb3_0
>> CPCs attempted: oob, xoob, rdmacm
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications. This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes. This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other. This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>> Process 1 ([[3115,1],0]) is on host: aae4
>> Process 2 ([[3115,1],1]) is on host: aae1
>> BTLs attempted: self
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications. This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes. This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other. This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>> Process 1 ([[3115,1],1]) is on host: aae1
>> Process 2 ([[3115,1],0]) is on host: aae4
>> BTLs attempted: self
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems. This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>> PML add procs failed
>> --> Returned "Unreachable" (-12) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems. This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>> PML add procs failed
>> --> Returned "Unreachable" (-12) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> *** before MPI was initialized
>> [aae1:6598] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [aae4:19434] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 0 with PID 19434 on
>> node aae4 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>
>
>
> Here is the another mpirun command I used (adding verbosity and more
> specific btl parameters):
>
> mpirun --mca orte_base_help_aggregate 0 --mca btl openib,self,sm --mca
> btl_base_verbose 10 --mca btl_openib_verbose 10 --mca
> btl_openib_if_include cxgb3_0:1 --mca btl_openib_cpc_include rdmacm
> --mca btl_openib_device_type iwarp --mca btl_openib_max_btls 1 --mca
> mpi_leave_pinned 1 --hostfile ~/1usrv_ompi_machfile -np 2 ./NPmpi -p0 -l
> 1 -u 1024 > ~/outfile2 2>&1

It looks like you are only using 1 port on the Chelsio RNIC. Based on
the messages above, It looks like it might be the wrong port. Is there
a reason why you are excluding it? Also, you might try the TCP btl and
verify that it works correctly in the testcase (as a point of
reference).

Thanks,
Jon

>
> and its output:
>> [aae4:19426] mca: base: components_open: Looking for btl components
>> [aae4:19426] mca: base: components_open: opening btl components
>> [aae4:19426] mca: base: components_open: found loaded component openib
>> [aae4:19426] mca: base: components_open: component openib has no register function
>> [aae4:19426] mca: base: components_open: component openib open function successful
>> [aae4:19426] mca: base: components_open: found loaded component self
>> [aae4:19426] mca: base: components_open: component self has no register function
>> [aae4:19426] mca: base: components_open: component self open function successful
>> [aae4:19426] mca: base: components_open: found loaded component sm
>> [aae4:19426] mca: base: components_open: component sm has no register function
>> [aae4:19426] mca: base: components_open: component sm open function successful
>> [aae1:06503] mca: base: components_open: Looking for btl components
>> [aae1:06503] mca: base: components_open: opening btl components
>> [aae1:06503] mca: base: components_open: found loaded component openib
>> [aae1:06503] mca: base: components_open: component openib has no register function
>> [aae1:06503] mca: base: components_open: component openib open function successful
>> [aae1:06503] mca: base: components_open: found loaded component self
>> [aae1:06503] mca: base: components_open: component self has no register function
>> [aae1:06503] mca: base: components_open: component self open function successful
>> [aae1:06503] mca: base: components_open: found loaded component sm
>> [aae1:06503] mca: base: components_open: component sm has no register function
>> [aae1:06503] mca: base: components_open: component sm open function successful
>> [aae4:19426] select: initializing btl component openib
>> [aae4][[3107,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI files for vendor 0x1425, part ID 49
>> [aae4][[3107,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found corresponding INI values: Chelsio T3
>> [aae4][[3107,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI files for vendor 0x0000, part ID 0
>> [aae4][[3107,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found corresponding INI values: default
>> [aae4:19426] openib BTL: rdmacm CPC available for use on cxgb3_0
>> [aae4:19426] select: init of component openib returned success
>> [aae4:19426] select: initializing btl component self
>> [aae4:19426] select: init of component self returned success
>> [aae4:19426] select: initializing btl component sm
>> [aae4:19426] select: init of component sm returned success
>> [aae1:06503] select: initializing btl component openib
>> [aae1][[3107,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI files for vendor 0x1425, part ID 49
>> [aae1][[3107,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found corresponding INI values: Chelsio T3
>> [aae1][[3107,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI files for vendor 0x0000, part ID 0
>> [aae1][[3107,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found corresponding INI values: default
>> [aae1:06503] openib BTL: rdmacm CPC available for use on cxgb3_0
>> [aae1:06503] select: init of component openib returned success
>> [aae1:06503] select: initializing btl component self
>> [aae1:06503] select: init of component self returned success
>> [aae1:06503] select: initializing btl component sm
>> [aae1:06503] select: init of component sm returned success
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications. This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes. This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other. This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>> Process 1 ([[3107,1],0]) is on host: aae4
>> Process 2 ([[3107,1],1]) is on host: aae1
>> BTLs attempted: openib self sm
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications. This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes. This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other. This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>> Process 1 ([[3107,1],1]) is on host: aae1
>> Process 2 ([[3107,1],0]) is on host: aae4
>> BTLs attempted: openib self sm
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems. This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>> PML add procs failed
>> --> Returned "Unreachable" (-12) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems. This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>> PML add procs failed
>> --> Returned "Unreachable" (-12) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [aae1:6503] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>> [aae4:19426] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 0 with PID 19426 on
>> node aae4 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>
>
>
> Thanks for any advice/help you can offer.
>
>
> -Ken
>
> This message is intended only for the designated recipient(s) and may
> contain confidential or proprietary information of Mercury Computer
> Systems, Inc. This message is solely intended to facilitate business
> discussions and does not constitute an express or implied offer to sell
> or purchase any products, services, or support. Any commitments must be
> made in writing and signed by duly authorized representatives of each
> party.
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users