Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Intercomm Merge
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-09-22 19:43:36


On Sep 22, 2013, at 2:15 PM, George Bosilca <bosilca_at_[hidden]> wrote:

> In fact there are only two type of information: one that is added by the OMPI layer, which is exchanged during the modex exchange stage, and whatever else is built on top of this information by different pieces of the software stack (including the RTE). If we mark these two types of data independently, we will be able to exchange only what was registered in the beginning, which is basically what is needed to connect processes together. Everything else should be built on top of this.

Agreed - we just have to figure out how to mark the data. However, there is RTE data sometimes required as well, as indicated below, though this depends on the RTE.

>
> This bring me to the second issue, a software layer setting up information for another one. I think we mixed things together by a lack of clear separation between the layers. The RTE should be in charge of setting things up and allowing processes to exchange information, not to babysit the MPI processes and annotate their modex with additional info.

I don't think we do, last I checked. The only RTE-related data in the ORTE-supported modex is that required by the RTE to support the MPI layer - e.g., URI info to support openib connection handshakes, or daemon vpid for locality computations. Otherwise, I'm not aware of anything added just for RTE purposes.

>
> In particular regarding the topology stuff, I don't see any reason not to be able to build at the MPI layer the info. Once we have a daemon name or vpid, it is trivial to figure out if two processes are or are not on the same node. If they are we can extract their topo information to figure more precise details (NUMA hierarchy or whatever).

There's the rub - daemon names/vpids is an ORTE concept that isn't shared by all RTEs. How we determine that we are on two different vs the same node is something done at the RTE level. In addition, some BTLs require RTE support for connection formation, and others don't - depends on the RTE as well as the BTL.

>
> This is something that is also hurting my effort toward moving the BTL in OPAL. I had to build a complex infrastructure to duplicate the connection information to be able to hide it from the RTE. Maybe it's time we address this problem in a more consistent way.

I don't understand why that would be necessary - eventually, the RTE is going to want to know that info anyway as it intends to use the BTLs as well. Why not just put it in the opal db?

>
> George.
>
>
> On Sep 19, 2013, at 11:08 , Ralph Castain <rhc_at_[hidden]> wrote:
>
>> Been wracking my brain on this, and I can't find any way to do this cleanly without invoking some kind of extension/modification to the MPI-RTE interface.
>>
>> The problem is that we are now executing an "in-band" modex operation. This is fine, but the modex operation (no matter how it is executed) is an RTE-dependent operation. Our current ompi_rte_modex function automatically performs it out-of-band, so we don't want to use it here. However, we currently lack any interface for directly obtaining endpoint info and/or for defining/setting locality.
>>
>> There are several ways we could resolve the endpoint problem:
>>
>> * define flags as I mentioned previously and modify the opal_db APIs to indicate "we want only non-RTE data"
>>
>> * set a convention that all OMPI-level data begin with a known substring like "ompi." - we could then simply call "fetch" with an "ompi.*" wildcard to retrieve all MPI-related data
>>
>> * modify the ompi_modex_* routines to insert "ompi." at the beginning of all keys - this would require an asprintf call, which means a malloc
>>
>> * add new functions "ompi_rte_get_endpoint_info" and "ompi_rte_set_endpoint_info", and let the RTEs figure out how to get/set the right data
>>
>>
>> The locality issue is a little tougher. I can't think of any RTE-agnostic method for setting locality. Unless someone else can, the only option I can propose is to add a new MPI-RTE interface "ompi_rte_set_locality(proc)".
>>
>> Thoughts?
>> Ralph
>>
>>
>> On Sep 18, 2013, at 10:18 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>> Actually, we wouldn't have to modify the interface - just have to define a DB_RTE flag and OR it to the DB_INTERNAL/DB_EXTERNAL one. We'd need to modify the "fetch" routines to pass the flag into them so we fetched the right things, but that's a simple change.
>>>
>>> On Sep 18, 2013, at 10:12 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>
>>>> I struggled with that myself when doing my earlier patch - part of the reason why I added the dpm API.
>>>>
>>>> I don't know how to update the locality without referencing RTE-specific keys, so maybe the best thing would be to provide some kind of hook into the db that says we want all the non-RTE keys? Would be simple to add that capability, though we'd have to modify the interface so we specify "RTE key" when doing the initial store.
>>>>
>>>> The "internal" flag is used to avoid re-sending data to the system under PMI. We "store" our data as "external" in the PMI components so the data gets pushed out, then fetch using PMI and store "internal" to put it in our internal hash. So "internal" doesn't mean "non-RTE".
>>>>
>>>>
>>>> On Sep 18, 2013, at 10:02 AM, George Bosilca <bosilca_at_[hidden]> wrote:
>>>>
>>>>> I hit send too early.
>>>>>
>>>>> Now that we move the entire "local" modex is there any way to trim it down or to replace the entries that are not correct anymore? Like the locality?
>>>>>
>>>>> George.
>>>>>
>>>>> On Sep 18, 2013, at 18:53 , George Bosilca <bosilca_at_[hidden]> wrote:
>>>>>
>>>>>> Regarding your comment on the bug trac, I noticed there is a DB_INTERNAL flag. While I see how to set I could not figure out any way to get it back.
>>>>>>
>>>>>> With the required modification of the DB API can't we take advantage of it?
>>>>>>
>>>>>> George.
>>>>>>
>>>>>>
>>>>>> On Sep 18, 2013, at 18:52 , Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>
>>>>>>> Thanks George - much appreciated
>>>>>>>
>>>>>>> On Sep 18, 2013, at 9:49 AM, George Bosilca <bosilca_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> The test case was broken. I just pushed a fix.
>>>>>>>>
>>>>>>>> George.
>>>>>>>>
>>>>>>>> On Sep 18, 2013, at 16:49 , Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>>> Hangs with any np > 1
>>>>>>>>>
>>>>>>>>> However, I'm not sure if that's an issue with the test vs the underlying implementation
>>>>>>>>>
>>>>>>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]> wrote:
>>>>>>>>>
>>>>>>>>>> Does it hang when you run with -np 4?
>>>>>>>>>>
>>>>>>>>>> Sent from my phone. No type good.
>>>>>>>>>>
>>>>>>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Strange - it works fine for me on my Mac. However, I see one difference - I only run it with np=1
>>>>>>>>>>>
>>>>>>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca <bosilca_at_[hidden]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> 1. sm doesn't work between spawned processes. So you must have another network enabled.
>>>>>>>>>>>>
>>>>>>>>>>>> I know :-). I have tcp available as well (OMPI will abort if you only run with sm,self because the comm_spawn will fail with unreachable errors -- I just tested/proved this to myself).
>>>>>>>>>>>>
>>>>>>>>>>>>> 2. Don't use the test case attached to my email, I left an xterm based spawn and the debugging. It can't work without xterm support. Instead try using the test case from the trunk, the one committed by Ralph.
>>>>>>>>>>>>
>>>>>>>>>>>> I didn't see any "xterm" strings in there, but ok. :-) I ran with orte/test/mpi/intercomm_create.c, and that hangs for me as well:
>>>>>>>>>>>>
>>>>>>>>>>>> -----
>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 4]
>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 5]
>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 6]
>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 7]
>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 4]
>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 5]
>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 6]
>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 7]
>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>> [hang]
>>>>>>>>>>>> -----
>>>>>>>>>>>>
>>>>>>>>>>>> Similarly, on my Mac, it hangs with no output:
>>>>>>>>>>>>
>>>>>>>>>>>> -----
>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>> [hang]
>>>>>>>>>>>> -----
>>>>>>>>>>>>
>>>>>>>>>>>>> George.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> George --
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your attached test case hangs:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 4]
>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 5]
>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 6]
>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 7]
>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 4]
>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 5]
>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 6]
>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 7]
>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On my Mac, it hangs without printing anything:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca <bosilca_at_[hidden]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here is a quick (and definitively not the cleanest) patch that addresses the MPI_Intercomm issue at the MPI level. It should be applied after removal of 29166.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I also added the corrected test case stressing the corner cases by doing barriers at every inter-comm creation and doing a clean disconnect.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>>> jsquyres_at_[hidden]
>>>>>>>>>>>>>> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>> jsquyres_at_[hidden]
>>>>>>>>>>>> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel