On Sep 17, 2013, at 23:19 , Ralph Castain <rhc@open-mpi.org> wrote:

I very much doubt that it would work, though I can give it a try, as the patch addresses Intercomm_merge and not Intercomm_create. I debated about putting the patch into "create" instead, but nobody was citing that as being a problem. In my opinion, it makes more sense for it to be in "create", and I can certainly shift it to that location easily enough.

So we converge here. If the problem was correctly addressed at Intercomm_create time there will be no need to address it Intercomm_merge, as the only way to get an intercomm where peers don't know each other modex info is via Intercomm_create. Every other function that create an inter-communicators do so starting from a common group, so the peers know each other.

My concern with your approach is that I'm not convinced it will work. The problem is that not all the MPI procs can communicate via MPI at this point because they lack the required info and haven't added the procs into the BTLs yet. So packing modex info into a buffer and attempting to send it via MPI could just cause the lockup to occur sooner.

You will have to believe me on this one, but MPI_Intercomm_create is a one of a kind call, not a very straightforward concept (this is why I suggested the read of the 6.6.2). One of the arguments to this function is a bridge communicator, where the two leaders belong together. So the two sides are not totally unknown to each other, their leaders know each other as they belong already to this "bridge" communicator (obviously each group should know how to communicate with their leader). My solution was to reduce the modex info on each group on their leader, let the leaders exchange this "local group modex information", and then broadcast locally the remote modex info.

Hence the approach of ensuring all procs have the required info. Not optimal, I agree, but performance isn't an issue with this function, and the trivial amount of RTE effort didn't seem worth worrying about.

My concern is that it forces every other RTE supported by Open MPI to provide a functionality that is so MPI specific that even the MPI libraries have a hard time supporting it.

I have a half working patch. Don't push the CMR yet, I'll ping you back soon.