Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OMPI-1.3.2, openib/iWARP(cxgb3) problem: PML add procs failed (Unreachable)
From: Jon Mason (jon_at_[hidden])
Date: 2009-05-06 15:30:36


On Wed, May 06, 2009 at 01:20:48PM -0400, Ken Cain wrote:
> Thanks Jon. I have some responses inline.
>
> Jon Mason wrote:
>> On Wed, May 06, 2009 at 12:15:19PM -0400, Ken Cain wrote:
>>> I am trying to run NetPIPE-3.7.1 NPmpi using Open MPI version 1.3.2
>>> with the openib btl in an OFED-1.4 environment. The system
>>> environment is two Linux (2.6.27) ppc64 blades, each with one
>>> Chelsio RNIC device, interconnected by a 10GbE switch. The problem
>>> is that I cannot (using Open MPI) establish connections between the
>>> two MPI ranks.
>>>
>>> I have already read the OMPI FAQ entries and searched for similar
>>> problems reported to this email list without success. I do have a
>>> compressed config.log that I can provide separately (it is 80KB in
>>> size so I'll spare everyone here). I also have the output of
>>> ompi_info --all that I can share.
>>>
>>> I can successfully run small diagnostic programs such as rping,
>>> ib_rdma_bw, ib_rdma_lat, etc. between the same two blades. I can also
>>> run NPmpi using another MPI library (MVAPICH2) and the Chelsio/iWARP
>>> interface.
>>>
>>> Here is the one example mpirun command line I used:
>>> mpirun --mca orte_base_help_aggregate 0 --mca btl openib,self
>>> --hostfile ~/1usrv_ompi_machfile -np 2 ./NPmpi -p0 -l 1 -u 1024 >
>>> outfile1 2>&1
>>>
>>> and its output:
>>>> --------------------------------------------------------------------------
>>>> No OpenFabrics connection schemes reported that they were able to be
>>>> used on a specific port. As such, the openib BTL (OpenFabrics
>>>> support) will be disabled for this port.
>>>>
>>>> Local host: aae1
>>>> Local device: cxgb3_0
>>>> CPCs attempted: oob, xoob, rdmacm
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> No OpenFabrics connection schemes reported that they were able to be
>>>> used on a specific port. As such, the openib BTL (OpenFabrics
>>>> support) will be disabled for this port.
>>>>
>>>> Local host: aae4
>>>> Local device: cxgb3_0
>>>> CPCs attempted: oob, xoob, rdmacm
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> At least one pair of MPI processes are unable to reach each other for
>>>> MPI communications. This means that no Open MPI device has indicated
>>>> that it can be used to communicate between these processes. This is
>>>> an error; Open MPI requires that all MPI processes be able to reach
>>>> each other. This error can sometimes be the result of forgetting to
>>>> specify the "self" BTL.
>>>>
>>>> Process 1 ([[3115,1],0]) is on host: aae4
>>>> Process 2 ([[3115,1],1]) is on host: aae1
>>>> BTLs attempted: self
>>>>
>>>> Your MPI job is now going to abort; sorry.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> At least one pair of MPI processes are unable to reach each other for
>>>> MPI communications. This means that no Open MPI device has indicated
>>>> that it can be used to communicate between these processes. This is
>>>> an error; Open MPI requires that all MPI processes be able to reach
>>>> each other. This error can sometimes be the result of forgetting to
>>>> specify the "self" BTL.
>>>>
>>>> Process 1 ([[3115,1],1]) is on host: aae1
>>>> Process 2 ([[3115,1],0]) is on host: aae4
>>>> BTLs attempted: self
>>>>
>>>> Your MPI job is now going to abort; sorry.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>>> likely to abort. There are many reasons that a parallel process can
>>>> fail during MPI_INIT; some of which are due to configuration or environment
>>>> problems. This failure appears to be an internal failure; here's some
>>>> additional information (which may only be relevant to an Open MPI
>>>> developer):
>>>>
>>>> PML add procs failed
>>>> --> Returned "Unreachable" (-12) instead of "Success" (0)
>>>> --------------------------------------------------------------------------
>>>> *** An error occurred in MPI_Init
>>>> --------------------------------------------------------------------------
>>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>>> likely to abort. There are many reasons that a parallel process can
>>>> fail during MPI_INIT; some of which are due to configuration or environment
>>>> problems. This failure appears to be an internal failure; here's some
>>>> additional information (which may only be relevant to an Open MPI
>>>> developer):
>>>>
>>>> PML add procs failed
>>>> --> Returned "Unreachable" (-12) instead of "Success" (0)
>>>> --------------------------------------------------------------------------
>>>> *** An error occurred in MPI_Init
>>>> *** before MPI was initialized
>>>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>>>> *** before MPI was initialized
>>>> [aae1:6598] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>>>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>>>> [aae4:19434] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>>>> --------------------------------------------------------------------------
>>>> mpirun has exited due to process rank 0 with PID 19434 on
>>>> node aae4 exiting without calling "finalize". This may
>>>> have caused other processes in the application to be
>>>> terminated by signals sent by mpirun (as reported here).
>>>> --------------------------------------------------------------------------
>>>
>>>
>>> Here is the another mpirun command I used (adding verbosity and more
>>> specific btl parameters):
>>>
>>> mpirun --mca orte_base_help_aggregate 0 --mca btl openib,self,sm
>>> --mca btl_base_verbose 10 --mca btl_openib_verbose 10 --mca
>>> btl_openib_if_include cxgb3_0:1 --mca btl_openib_cpc_include rdmacm
>>> --mca btl_openib_device_type iwarp --mca btl_openib_max_btls 1 --mca
>>> mpi_leave_pinned 1 --hostfile ~/1usrv_ompi_machfile -np 2 ./NPmpi -p0
>>> -l 1 -u 1024 > ~/outfile2 2>&1
>>
>> It looks like you are only using 1 port on the Chelsio RNIC. Based on
>> the messages above, It looks like it might be the wrong port. Is there
>> a reason why you are excluding it? Also, you might try the TCP btl and
>> verify that it works correctly in the testcase (as a point of
>> reference).
>>
>> Thanks,
>> Jon
>>
>
> Yes we only have one port connected. The cxgb3 device is associated with
> eth2 and eth3. Only eth2 is configured with a static IP address. To be
> sure I didn't choose the wrong OFED device I tried the same command
> line, changing btl_openib_if_include to cxgb3_0:2 (instead of my
> original choice cxgb3_0:1). I got the same result in this experiment.
> The same result occurs when I ask only for cxgb3_0 (no particular port).
>
> To test with the TCP btl I changed both of my mpirun commands so that I
> added tcp to the --mca btl list (keeping openib and the others). In both
> cases the NPmpi application runs to completion (using a TCP/IP transport
> not iWARP). In the first (simpler) mpirun command I get the same "No
> OpenFabrics connection schemes ..." warning message (followed by
> successful run to completion as noted). In the second mpirun command I
> get no particular warning messages and run to completion.

Hmm...If you are just adding tcp and keeping openib in the btl grouping,
it might be possible that it is using the openib btl with the simplier
commandline. Can you try the simplier commandline without tcp in the
btl? Also, please show the commands you are running (working and
non-working).

Is there a reason why the mpirun command you are using is so complex?

Thanks,
Jon
>
>
>
>>> and its output:
>>>> [aae4:19426] mca: base: components_open: Looking for btl components
>>>> [aae4:19426] mca: base: components_open: opening btl components
>>>> [aae4:19426] mca: base: components_open: found loaded component openib
>>>> [aae4:19426] mca: base: components_open: component openib has no register function
>>>> [aae4:19426] mca: base: components_open: component openib open function successful
>>>> [aae4:19426] mca: base: components_open: found loaded component self
>>>> [aae4:19426] mca: base: components_open: component self has no register function
>>>> [aae4:19426] mca: base: components_open: component self open function successful
>>>> [aae4:19426] mca: base: components_open: found loaded component sm
>>>> [aae4:19426] mca: base: components_open: component sm has no register function
>>>> [aae4:19426] mca: base: components_open: component sm open function successful
>>>> [aae1:06503] mca: base: components_open: Looking for btl components
>>>> [aae1:06503] mca: base: components_open: opening btl components
>>>> [aae1:06503] mca: base: components_open: found loaded component openib
>>>> [aae1:06503] mca: base: components_open: component openib has no register function
>>>> [aae1:06503] mca: base: components_open: component openib open function successful
>>>> [aae1:06503] mca: base: components_open: found loaded component self
>>>> [aae1:06503] mca: base: components_open: component self has no register function
>>>> [aae1:06503] mca: base: components_open: component self open function successful
>>>> [aae1:06503] mca: base: components_open: found loaded component sm
>>>> [aae1:06503] mca: base: components_open: component sm has no register function
>>>> [aae1:06503] mca: base: components_open: component sm open function successful
>>>> [aae4:19426] select: initializing btl component openib
>>>> [aae4][[3107,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI files for vendor 0x1425, part ID 49
>>>> [aae4][[3107,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found corresponding INI values: Chelsio T3
>>>> [aae4][[3107,1],0][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI files for vendor 0x0000, part ID 0
>>>> [aae4][[3107,1],0][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found corresponding INI values: default
>>>> [aae4:19426] openib BTL: rdmacm CPC available for use on cxgb3_0
>>>> [aae4:19426] select: init of component openib returned success
>>>> [aae4:19426] select: initializing btl component self
>>>> [aae4:19426] select: init of component self returned success
>>>> [aae4:19426] select: initializing btl component sm
>>>> [aae4:19426] select: init of component sm returned success
>>>> [aae1:06503] select: initializing btl component openib
>>>> [aae1][[3107,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI files for vendor 0x1425, part ID 49
>>>> [aae1][[3107,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found corresponding INI values: Chelsio T3
>>>> [aae1][[3107,1],1][btl_openib_ini.c:166:ompi_btl_openib_ini_query] Querying INI files for vendor 0x0000, part ID 0
>>>> [aae1][[3107,1],1][btl_openib_ini.c:185:ompi_btl_openib_ini_query] Found corresponding INI values: default
>>>> [aae1:06503] openib BTL: rdmacm CPC available for use on cxgb3_0
>>>> [aae1:06503] select: init of component openib returned success
>>>> [aae1:06503] select: initializing btl component self
>>>> [aae1:06503] select: init of component self returned success
>>>> [aae1:06503] select: initializing btl component sm
>>>> [aae1:06503] select: init of component sm returned success
>>>> --------------------------------------------------------------------------
>>>> At least one pair of MPI processes are unable to reach each other for
>>>> MPI communications. This means that no Open MPI device has indicated
>>>> that it can be used to communicate between these processes. This is
>>>> an error; Open MPI requires that all MPI processes be able to reach
>>>> each other. This error can sometimes be the result of forgetting to
>>>> specify the "self" BTL.
>>>>
>>>> Process 1 ([[3107,1],0]) is on host: aae4
>>>> Process 2 ([[3107,1],1]) is on host: aae1
>>>> BTLs attempted: openib self sm
>>>>
>>>> Your MPI job is now going to abort; sorry.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> At least one pair of MPI processes are unable to reach each other for
>>>> MPI communications. This means that no Open MPI device has indicated
>>>> that it can be used to communicate between these processes. This is
>>>> an error; Open MPI requires that all MPI processes be able to reach
>>>> each other. This error can sometimes be the result of forgetting to
>>>> specify the "self" BTL.
>>>>
>>>> Process 1 ([[3107,1],1]) is on host: aae1
>>>> Process 2 ([[3107,1],0]) is on host: aae4
>>>> BTLs attempted: openib self sm
>>>>
>>>> Your MPI job is now going to abort; sorry.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>>> likely to abort. There are many reasons that a parallel process can
>>>> fail during MPI_INIT; some of which are due to configuration or environment
>>>> problems. This failure appears to be an internal failure; here's some
>>>> additional information (which may only be relevant to an Open MPI
>>>> developer):
>>>>
>>>> PML add procs failed
>>>> --> Returned "Unreachable" (-12) instead of "Success" (0)
>>>> --------------------------------------------------------------------------
>>>> *** An error occurred in MPI_Init
>>>> *** before MPI was initialized
>>>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>>>> --------------------------------------------------------------------------
>>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>>> likely to abort. There are many reasons that a parallel process can
>>>> fail during MPI_INIT; some of which are due to configuration or environment
>>>> problems. This failure appears to be an internal failure; here's some
>>>> additional information (which may only be relevant to an Open MPI
>>>> developer):
>>>>
>>>> PML add procs failed
>>>> --> Returned "Unreachable" (-12) instead of "Success" (0)
>>>> --------------------------------------------------------------------------
>>>> *** An error occurred in MPI_Init
>>>> *** before MPI was initialized
>>>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>>>> [aae1:6503] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>>>> [aae4:19426] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
>>>> --------------------------------------------------------------------------
>>>> mpirun has exited due to process rank 0 with PID 19426 on
>>>> node aae4 exiting without calling "finalize". This may
>>>> have caused other processes in the application to be
>>>> terminated by signals sent by mpirun (as reported here).
>>>> --------------------------------------------------------------------------
>>>
>>>
>>> Thanks for any advice/help you can offer.
>>>
>>>
>>> -Ken
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Thanks,
>
> -Ken
>
> This message is intended only for the designated recipient(s) and may
> contain confidential or proprietary information of Mercury Computer
> Systems, Inc. This message is solely intended to facilitate business
> discussions and does not constitute an express or implied offer to sell
> or purchase any products, services, or support. Any commitments must be
> made in writing and signed by duly authorized representatives of each
> party.
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users