After discussion on the telecon, we decided to:
1. let the modex be non-blocking so we can fall thru - only when the corresponding MCA param is set!
2. do not modify the modex_recv to add the callback as the MPI layer really doesn't know how to handle this in an async fashion. Modifying that behavior would be difficult and could wind up impacting the critical path - something we may decide to look into more at a later time
So we will block in a call to modex_recv until the info for that particular process can be obtained. I'll add a timeout feature (via yet another MCA param) so we can gracefully recover if the remote proc never answers for some reason.
Will provide an update when this is ready
What I want to do is make the current "modex" become a no-op, which means we have a lazy modex. As I noted in my commit message, this scales horribly if we don't, hence the MCA param requirement so people don't do this unless their BTL/MTLs will support it.
What I found when playing with that arrangement is that a BTL/MTL is going to need or want data at first message, but that data may not be available yet because that particular remote proc hasn't registered all of its modex data yet. A beautiful race condition. So I was forced to block everyone at "modex" just to ensure the remote data was available at time of request.
If I remove the global "barrier" requirement, then I didn't want to "block" on modex_recv as this is done on a per-proc basis. Even though one proc isn't ready to return the data, another might be - and so I'd let you queue up as many modex_recv calls as you like, resolving each of them as data becomes available. This leaves the MPI layer free to send a message as soon as the target remote proc is ready, without waiting for some other proc to register its modex info.