Im concerned about your usage of abort here. Looking at the code I noticed that you call RTE_ABORT deep inside the BTL stack. This is a significant divergence from our current behavior (except for USNIC apparently as the code is now in the 1.7). The BTLs are not deciders, but merely reporters. Any error should be reported upstream, and will be dealt with at that level.
If you want to pursue such a drastic change in the behavior of Open MPI, you should definitively make it through an RFC.
On Feb 26, 2014, at 23:21 , svn-commit-mailer_at_[hidden] wrote:
> Author: jsquyres (Jeff Squyres)
> Date: 2014-02-26 17:21:25 EST (Wed, 26 Feb 2014)
> New Revision: 30860
> URL: https://svn.open-mpi.org/trac/ompi/changeset/30860
> Add usnic connectivity-checking agent service.
> Basically: since usnic is a connectionless transport, we do not get
> OS-provided services "for free" that connection-oriented transports
> get, namely: "hey, I wasn't able to make a connection to peer X", and
> "hey, your connection to peer X has died."
> This connectivity-checker runs in a separate progress thread in the
> usnic BTL in local rank 0 on each server. Upon first send in any
> process, the connectivty-checker agent will send some UDP pings to the
> peer to ensure that we can reach it. If we can't, we'll abort the job
> with a nice show_help message.
> There's a lengthy comment in btl_usnic_connectivity.h explains the
> scheme and how it works.
> Reviewed by Dave Goodell.