Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] PML-bfo deadlocks for message size > eager limit after connection loss
From: Rolf vandeVaart (rvandevaart_at_[hidden])
Date: 2014-07-24 10:01:13


My guess is that no one is testing the bfo PML. However, I would have expected it to still work with Open MPI 1.6.5. From your description, it works for smaller messages but fails with larger ones? So, if you just send smaller messages and pull the cable, things work correctly?

One idea is to reduce the output you are getting so you can focus on just the failover information. There is no need for any ORTE debug information as that is not involved in the failover. I would go with these:

mpirun -np 2 --hostfile /opt/ddt/nodes --pernode --mca pml bfo --mca btl self,sm,openib --mca btl_openib_port_error_failover 1 --mca btl_openib_verbose_failover 100 --mca pml_bfo_verbose 100

You can drop this: --mca btl_openib_failover_enabled 1 (that is on by default)
 
In terms of where you can debug, most of the failover support code is in two files.
ompi/mca/pml/bfo/pml_bfo_failover.c
ompi/mca/btl/openib/btl_openib_failover.c

There is also a READE here:
ompi/mca/pml/bfo/README

You could also try running without eager RDMA enabled: --mca btl_openib_use_eager_rdma 0

Rolf

>-----Original Message-----
>From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of Christoph
>Niethammer
>Sent: Thursday, July 24, 2014 7:54 AM
>To: Open MPI Developers
>Subject: [OMPI devel] PML-bfo deadlocks for message size > eager limit after
>connection loss
>
>Hello,
>
>Is there anybody using/testing the bfo PML - especially with messages > eager
>limit?
>
>Tests using messages > eager limit with the bfo PML seem to deadlock in
>Open MPI 1.6.5 as soon as one of two infiniband connections gets lost (tested
>by disconnecting wire).
>I did not have an opportunity to test 1.8/trunk up to now.
>
>Tests were executed with the following mpirun options:
>
>mpirun -np 2 --hostfile /opt/ddt/nodes --pernode --mca pml bfo --mca
>btl_base_exclude tcp --mca pml bfo --mca btl_openib_port_error_failover 1 -
>-mca btl_openib_failover_enabled 1 --mca btl_openib_port_error_failover 1 -
>-verbose --mca oob_tcp_verbose 100 --mca btl_openib_verbose_failover 100
>--mca btl_openib_verbose 100 --mca btl_base_verbose 100 --mca
>bml_base_verbose 100 --mca pml_bfo_verbose 100 --mca pml_base_verbose
>100 --mca opal_progress_debug 100 --mca orte_debug_verbose 100 --mca
>pml_v_verbose 100 --mca orte_base_help_aggregate 0
>
>Some log output is attached below.
>
>I would appreciate any feedback concerning current status of the bfo PML as
>well as ideas how to debug and where to search for the problem inside the
>Open MPI code base.
>
>
>Best regards
>Christoph Niethammer
>
>--
>
>Christoph Niethammer
>High Performance Computing Center Stuttgart (HLRS) Nobelstrasse 19
>70569 Stuttgart
>
>Tel: ++49(0)711-685-87203
>email: niethammer_at_[hidden]
>http://www.hlrs.de/people/niethammer
>
>
>
>
>
>
>[vm2:21970] defining message event: iof_hnp_receive.c 227 [vm1:16449]
>Rank 0 receiving ...
>[vm2:21970] [[22205,0],0] got show_help from [[22205,1],0]
>--------------------------------------------------------------------------
>The OpenFabrics stack has reported a network error event. Open MPI will try
>to continue, but your job may end up failing.
>
> Local host: vm1
> MPI process PID: 16449
> Error number: 10 (IBV_EVENT_PORT_ERR)
>
>This error may indicate connectivity problems within the fabric; please contact
>your system administrator.
>--------------------------------------------------------------------------
>[vm1][[22205,1],0][btl_openib.c:1350:mca_btl_openib_prepare_dst] frag-
>>sg_entry.lkey = 1829372025 .addr = 1e1bee0 frag-
>>segment.seg_key.key32[0] = 1829372025
>[vm1][[22205,1],0][btl_openib.c:1350:mca_btl_openib_prepare_dst] frag-
>>sg_entry.lkey = 1829372025 .addr = 1e28230 frag-
>>segment.seg_key.key32[0] = 1829372025 [vm2:21970] defining message
>event: iof_hnp_receive.c 227 [vm1:16449] Bandwidth [MB/s]: 594.353640
>[vm1:16449] Rank 0: loop: 1100 [vm1:16449] Rank 0 sending ...
>[vm2:21970] defining message event: iof_hnp_receive.c 227 [vm2:21970]
>defining message event: iof_hnp_receive.c 227
>[vm1][[22205,1],0][btl_openib_failover.c:696:mca_btl_openib_endpoint_noti
>fy] [vm1:16449] BTL openib error: rank=0 mapping out lid=2:name=mthca0 to
>rank=1 on node=vm2 [vm1:16449] IB: Finished checking for pending_frags,
>total moved=0 [vm1:16449] IB: Finished checking for pending_frags, total
>moved=0 Error sending BROKEN CONNECTION buffer (Success)
>[[22205,1],1][btl_openib_component.c:3496:handle_wc] from vm2 to: 192
>error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for
>wr_id bdba80 opcode 1 vendor error 129 qp_idx 0 [vm2:21970] [[22205,0],0]
>got show_help from [[22205,1],1]
>--------------------------------------------------------------------------
>The InfiniBand retry count between two MPI processes has been exceeded.
>"Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38):
>
> The total number of times that the sender wishes the receiver to
> retry timeout, packet sequence, etc. errors before posting a
> completion error.
>
>This error typically means that there is something awry within the InfiniBand
>fabric itself. You should note the hosts on which this error has occurred; it has
>been observed that rebooting or removing a particular host from the job can
>sometimes resolve this issue.
>
>Two MCA parameters can be used to control Open MPI's behavior with
>respect to the retry count:
>
>* btl_openib_ib_retry_count - The number of times the sender will
> attempt to retry (defaulted to 7, the maximum value).
>* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
> to 20). The actual timeout value used is calculated as:
>
> 4.096 microseconds * (2^btl_openib_ib_timeout)
>
> See the InfiniBand spec 1.2 (section 12.7.34) for more details.
>
>Below is some information about the host that raised the error and the peer
>to which it was connected:
>
> Local host: vm2
> Local device: mthca0
> Peer host: 192
>
>You may need to consult with your system administrator to get this problem
>fixed.
>--------------------------------------------------------------------------
>[vm2:21982] MCA_BTL_OPENIG_FRAG=5, dropping since connection is
>broken (des=bdba80)
>[[22205,1],1][btl_openib_component.c:3496:handle_wc] from vm2 to: 192
>error polling HP CQ with status WORK REQUEST FLUSHED ERROR status
>number 5 for wr_id c56380 opcode 1 vendor error 244 qp_idx 0 [vm2:21982]
>MCA_BTL_OPENIG_FRAG=0, dropping since connection is broken
>(des=c56380) [vm2:21982] MCA_BTL_OPENIG_FRAG=0, dropping since
>connection is broken (des=c56200) [vm2:21982] MCA_BTL_OPENIG_FRAG=0,
>dropping since connection is broken (des=c56080) [vm2:21982]
>MCA_BTL_OPENIG_FRAG=0, dropping since connection is broken
>(des=c55f00)
>
>...
>
>[vm2:21982] MCA_BTL_OPENIG_FRAG=0, dropping since connection is
>broken (des=c74a00) [vm2:21970] defining message event: iof_hnp_receive.c
>227 [[22205,1],0][btl_openib_component.c:3496:handle_wc] from vm1 to:
>vm2 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12
>for wr_id 1dbe980 opcode 0 vendor error 129 qp_idx 0 [vm2:21970]
>[[22205,0],0] got show_help from [[22205,1],0]
>--------------------------------------------------------------------------
>The InfiniBand retry count between two MPI processes has been exceeded.
>"Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38):
>
> The total number of times that the sender wishes the receiver to
> retry timeout, packet sequence, etc. errors before posting a
> completion error.
>
>This error typically means that there is something awry within the InfiniBand
>fabric itself. You should note the hosts on which this error has occurred; it has
>been observed that rebooting or removing a particular host from the job can
>sometimes resolve this issue.
>
>Two MCA parameters can be used to control Open MPI's behavior with
>respect to the retry count:
>
>* btl_openib_ib_retry_count - The number of times the sender will
> attempt to retry (defaulted to 7, the maximum value).
>* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
> to 20). The actual timeout value used is calculated as:
>
> 4.096 microseconds * (2^btl_openib_ib_timeout)
>
> See the InfiniBand spec 1.2 (section 12.7.34) for more details.
>
>Below is some information about the host that raised the error and the peer
>to which it was connected:
>
> Local host: vm1
> Local device: mthca0
> Peer host: vm2
>
>You may need to consult with your system administrator to get this problem
>fixed.
>--------------------------------------------------------------------------
>[vm2:21970] defining message event: iof_hnp_receive.c 227 [vm2:21970]
>defining message event: iof_hnp_receive.c 227 [vm2:21970] defining message
>event: iof_hnp_receive.c 227 [vm1:16449] MCA_BTL_OPENIG_FRAG=5,
>dropping since connection is broken (des=1dbe980) [vm1:16449]
>MCA_BTL_OPENIG_FRAG=0, dropping since connection is broken
>(des=1e39880) _______________________________________________
>devel mailing list
>devel_at_[hidden]
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: http://www.open-
>mpi.org/community/lists/devel/2014/07/15243.php
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------