On Nov 14, 2013, at 3:07 PM, Shamis, Pavel <shamisp_at_[hidden]> wrote:
>> So far as I can tell, the issue is one of blocking. The OOB handshake is now async - i.e., you post a non-blocking recv at the beginning of time, and then do a non-blocking send to the other side when you want to create a connection. The question is: how do you know when that connection is ready?
> As you describe, the new behavior is identical to original one. We post non-blocking (persistent) receive during initialization. Later OMPI has barrier in the flow to ensure that all processes reached the point.
> On first send, we use a non-blocking oob-send to initialize the connection (QPs). The receive triggers callback that handles the connection setup. OOB / XOOB communication semantics is a fully non-blocking.
> We don't really block anywhere.
> We use ompi_rte_recv_buffer_nb and ompi_rte_send_buffer_nb functions only.
The only change is that the receive callback is now occurring in the ORTE event thread, and so perhaps someone needs to look at a way to pass that back into the OMPI event base (which I guess is the OPAL event base)? Just glancing at the code, it looks like that could be the issue - but I honestly have no idea what event base someone wants to switch to, or if they want to resolve it some other way. There are clearly some things happening in the ofacm oob code that involve thread locking etc., but I don't know what those areas are trying to do.
> devel mailing list