Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with MPI_Intercomm_create
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-06-13 08:31:35


George -- can you file a ticket about this?

On Jun 12, 2011, at 1:25 PM, George Bosilca wrote:

> Fraderic,
>
> Based on the current version of the MPI standard, the two groups involved in the intercomm_create have to be disjoints, which means the leader cannot be the same process.
>
> Regarding the issue in Open MPI, the problem is deep in our modex exchange (contact information). In the example I sent around a while back, the intercomm_create is working, but the resulting communicator contains processes without this modex information. This lead to an error on the next collective communication.
>
> george.
>
> On Jun 12, 2011, at 03:44 , Frédéric Feyel wrote:
>
>> Dear all, thank you very much for the time spent at looking at my problem.
>>
>> After reading your contributions, it's not clear wether there is a bug in
>> OpenMPI or not.
>>
>> So I created a small self contained source code to analyse the behavior,
>> and the problem is still there.
>>
>> I was wondering if the local and remote leader in the 2 groups could be
>> the same process. Unfortunately, I get
>> an error in the two cases (local and remote leader identical or not).
>>
>> What do you think about my small source code ?
>>
>> Best regards,
>>
>> Frédéric.
>>
>>
>> On Tue, 07 Jun 2011 10:31:51 -0500, Edgar Gabriel <gabriel_at_[hidden]>
>> wrote:
>>> On 6/7/2011 10:23 AM, George Bosilca wrote:
>>>>
>>>> On Jun 7, 2011, at 11:00 , Edgar Gabriel wrote:
>>>>
>>>>> George,
>>>>>
>>>>> I did not look over all the details of your test, but it looks to
>>>>> me like you are violating one of the requirements of
>>>>> intercomm_create namely the request that the two groups have to be
>>>>> disjoint. In your case the parent process(es) are part of both
>>>>> local intra-communicators, isn't it?
>>>>
>>>> The two groups of the two local communicators are disjoints. One
>>>> contains A,B while the other only C. The bridge communicator contains
>>>> A,C.
>>>>
>>>> I'm confident my example is supposed to work. At least for Open MPI
>>>> the error is under the hood, as the resulting inter-communicator is
>>>> valid but contains NULL endpoints for the remote process.
>>>
>>> I'll come back to that later, I am not yet convinced that your code is
>>> correct :-) Your local groups might be disjoint, but I am worried about
>>> the ranks of the remote leader in your example. THey can not be 0 from
>>> both groups perspective.
>>>
>>>>
>>>> Regarding the fact that the two leader should be separate processes,
>>>> you will not find any wording about this in the current version of
>>>> the standard. In the 1.1 there were two opposite sentences about this
>>>> one stating that the two groups can be disjoint, while the other
>>>> claiming that the two leaders can be the same process. After
>>>> discussion, the agreement was that the two groups have to be
>>>> disjoint, and the standard has been amended to match the agreement.
>>>
>>>
>>> I realized that this is a non-issue. If the two local groups are
>>> disjoint, there is no way that the two local leaders are the same
>> process.
>>>
>>> Thanks
>>> Edgar
>>>
>>>>
>>>> george.
>>>>
>>>>
>>>>>
>>>>> I just have MPI-1.1. at hand right now, but here is what it says:
>>>>> ----
>>>>>
>>>>> Overlap of local and remote groups that are bound into an
>>>>> inter-communicator is prohibited. If there is overlap, then the
>>>>> program is erroneous and is likely to deadlock.
>>>>>
>>>>> ---- so bottom line is that the two local intra-communicators that
>>>>> are being used have to be disjoint, and the bridgecomm needs to be
>>>>> a communicator where at least one process of each of the two
>>>>> disjoint groups need to be able to talk to each other.
>>>>> Interestingly I did not find a sentence whether it is allowed to be
>>>>> the same process, or whether the two local leaders need to be
>>>>> separate processes...
>>>>>
>>>>>
>>>>> Thanks Edgar
>>>>>
>>>>>
>>>>> On 6/7/2011 12:57 AM, George Bosilca wrote:
>>>>>> Frederic,
>>>>>>
>>>>>> Attached you will find an example that is supposed to work. The
>>>>>> main difference with your code is on T3, T4 where you have
>>>>>> inversed the local and remote comm. As depicted on the picture
>>>>>> attached below, during the 3th step you will create the intercomm
>>>>>> between ab and c (no overlap) using ac as a bridge communicator
>>>>>> (here the two roots, a and c, can exchange messages).
>>>>>>
>>>>>> Based on the MPI 2.2 standard, especially on the paragraph in
>>>>>> PS:, the attached code should have been working. Unfortunately, I
>>>>>> couldn't run it successfully neither with Open MPI trunk nor
>>>>>> MPICH2 1.4rc1.
>>>>>>
>>>>>> george.
>>>>>>
>>>>>> PS: Here is what the MPI standard states about the
>>>>>> MPI_Intercomm_create:
>>>>>>> The function MPI_INTERCOMM_CREATE can be used to create an
>>>>>>> inter-communicator from two existing intra-communicators, in
>>>>>>> the following situation: At least one selected member from each
>>>>>>> group (the “group leader”) has the ability to communicate with
>>>>>>> the selected member from the other group; that is, a “peer”
>>>>>>> communicator exists to which both leaders belong, and each
>>>>>>> leader knows the rank of the other leader in this peer
>>>>>>> communicator. Furthermore, members of each group know the rank
>>>>>>> of their leader.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Jun 1, 2011, at 05:00 , Frédéric Feyel wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have a problem using MPI_Intercomm_create.
>>>>>>>
>>>>>>> I 5 tasks, let's say T0, T1, T2, T3, T4 resulting from two
>>>>>>> spawn operations by T0.
>>>>>>>
>>>>>>> So I have two intra-communicator :
>>>>>>>
>>>>>>> intra0 contains : T0, T1, T2 intra1 contains : T0, T3, T4
>>>>>>>
>>>>>>> my goal is to make a collective loop to build a single
>>>>>>> intra-communicator containing T0, T1, T2, T3, T4
>>>>>>>
>>>>>>> I tried to do it using MPI_Intercomm_create and
>>>>>>> MPI_Intercom_merge calls, but without success (I always get MPI
>>>>>>> internal errors).
>>>>>>>
>>>>>>> What I am doing :
>>>>>>>
>>>>>>> on T0 : *******
>>>>>>>
>>>>>>> MPI_Intercom_create(intra0,0,intra1,0,1,&new_com)
>>>>>>>
>>>>>>> on T1 and T2 : **************
>>>>>>>
>>>>>>> MPI_Intercom_create(intra0,0,MPI_COMM_WORLD,0,1,&new_com)
>>>>>>>
>>>>>>> on T3 and T4 : **************
>>>>>>>
>>>>>>> MPI_Intercom_create(intra1,0,MPI_COMM_WORLD,0,1,&new_com)
>>>>>>>
>>>>>>>
>>>>>>> I'm certainly missing something. Could anybody help me to solve
>>>>>>> this problem ?
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Frédéric.
>>>>>>>
>>>>>>> PS : of course I did an extensive web search without finding
>>>>>>> anything usefull on my problem.
>>>>>>>
>>>>>>> _______________________________________________ users mailing
>>>>>>> list users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________ users mailing
>>>>>> list users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> -- Edgar Gabriel Assistant Professor Parallel Software Technologies
>>>>> Lab http://pstl.cs.uh.edu Department of Computer Science
>>>>> University of Houston Philip G. Hoffman Hall, Room 524
>>>>> Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax:
>>>>> +1 (713) 743-3335
>>>>>
>>>>> _______________________________________________ users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________ users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> <spawn-example.c>_______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/