Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with MPI_Intercomm_create
From: Frédéric Feyel (mcffeyel_at_[hidden])
Date: 2011-06-12 03:44:05


Dear all, thank you very much for the time spent at looking at my problem.

After reading your contributions, it's not clear wether there is a bug in
OpenMPI or not.

So I created a small self contained source code to analyse the behavior,
and the problem is still there.

I was wondering if the local and remote leader in the 2 groups could be
the same process. Unfortunately, I get
an error in the two cases (local and remote leader identical or not).

What do you think about my small source code ?

Best regards,

Frédéric.

On Tue, 07 Jun 2011 10:31:51 -0500, Edgar Gabriel <gabriel_at_[hidden]>
wrote:
> On 6/7/2011 10:23 AM, George Bosilca wrote:
>>
>> On Jun 7, 2011, at 11:00 , Edgar Gabriel wrote:
>>
>>> George,
>>>
>>> I did not look over all the details of your test, but it looks to
>>> me like you are violating one of the requirements of
>>> intercomm_create namely the request that the two groups have to be
>>> disjoint. In your case the parent process(es) are part of both
>>> local intra-communicators, isn't it?
>>
>> The two groups of the two local communicators are disjoints. One
>> contains A,B while the other only C. The bridge communicator contains
>> A,C.
>>
>> I'm confident my example is supposed to work. At least for Open MPI
>> the error is under the hood, as the resulting inter-communicator is
>> valid but contains NULL endpoints for the remote process.
>
> I'll come back to that later, I am not yet convinced that your code is
> correct :-) Your local groups might be disjoint, but I am worried about
> the ranks of the remote leader in your example. THey can not be 0 from
> both groups perspective.
>
>>
>> Regarding the fact that the two leader should be separate processes,
>> you will not find any wording about this in the current version of
>> the standard. In the 1.1 there were two opposite sentences about this
>> one stating that the two groups can be disjoint, while the other
>> claiming that the two leaders can be the same process. After
>> discussion, the agreement was that the two groups have to be
>> disjoint, and the standard has been amended to match the agreement.
>
>
> I realized that this is a non-issue. If the two local groups are
> disjoint, there is no way that the two local leaders are the same
process.
>
> Thanks
> Edgar
>
>>
>> george.
>>
>>
>>>
>>> I just have MPI-1.1. at hand right now, but here is what it says:
>>> ----
>>>
>>> Overlap of local and remote groups that are bound into an
>>> inter-communicator is prohibited. If there is overlap, then the
>>> program is erroneous and is likely to deadlock.
>>>
>>> ---- so bottom line is that the two local intra-communicators that
>>> are being used have to be disjoint, and the bridgecomm needs to be
>>> a communicator where at least one process of each of the two
>>> disjoint groups need to be able to talk to each other.
>>> Interestingly I did not find a sentence whether it is allowed to be
>>> the same process, or whether the two local leaders need to be
>>> separate processes...
>>>
>>>
>>> Thanks Edgar
>>>
>>>
>>> On 6/7/2011 12:57 AM, George Bosilca wrote:
>>>> Frederic,
>>>>
>>>> Attached you will find an example that is supposed to work. The
>>>> main difference with your code is on T3, T4 where you have
>>>> inversed the local and remote comm. As depicted on the picture
>>>> attached below, during the 3th step you will create the intercomm
>>>> between ab and c (no overlap) using ac as a bridge communicator
>>>> (here the two roots, a and c, can exchange messages).
>>>>
>>>> Based on the MPI 2.2 standard, especially on the paragraph in
>>>> PS:, the attached code should have been working. Unfortunately, I
>>>> couldn't run it successfully neither with Open MPI trunk nor
>>>> MPICH2 1.4rc1.
>>>>
>>>> george.
>>>>
>>>> PS: Here is what the MPI standard states about the
>>>> MPI_Intercomm_create:
>>>>> The function MPI_INTERCOMM_CREATE can be used to create an
>>>>> inter-communicator from two existing intra-communicators, in
>>>>> the following situation: At least one selected member from each
>>>>> group (the “group leader”) has the ability to communicate with
>>>>> the selected member from the other group; that is, a “peer”
>>>>> communicator exists to which both leaders belong, and each
>>>>> leader knows the rank of the other leader in this peer
>>>>> communicator. Furthermore, members of each group know the rank
>>>>> of their leader.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Jun 1, 2011, at 05:00 , Frédéric Feyel wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have a problem using MPI_Intercomm_create.
>>>>>
>>>>> I 5 tasks, let's say T0, T1, T2, T3, T4 resulting from two
>>>>> spawn operations by T0.
>>>>>
>>>>> So I have two intra-communicator :
>>>>>
>>>>> intra0 contains : T0, T1, T2 intra1 contains : T0, T3, T4
>>>>>
>>>>> my goal is to make a collective loop to build a single
>>>>> intra-communicator containing T0, T1, T2, T3, T4
>>>>>
>>>>> I tried to do it using MPI_Intercomm_create and
>>>>> MPI_Intercom_merge calls, but without success (I always get MPI
>>>>> internal errors).
>>>>>
>>>>> What I am doing :
>>>>>
>>>>> on T0 : *******
>>>>>
>>>>> MPI_Intercom_create(intra0,0,intra1,0,1,&new_com)
>>>>>
>>>>> on T1 and T2 : **************
>>>>>
>>>>> MPI_Intercom_create(intra0,0,MPI_COMM_WORLD,0,1,&new_com)
>>>>>
>>>>> on T3 and T4 : **************
>>>>>
>>>>> MPI_Intercom_create(intra1,0,MPI_COMM_WORLD,0,1,&new_com)
>>>>>
>>>>>
>>>>> I'm certainly missing something. Could anybody help me to solve
>>>>> this problem ?
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Frédéric.
>>>>>
>>>>> PS : of course I did an extensive web search without finding
>>>>> anything usefull on my problem.
>>>>>
>>>>> _______________________________________________ users mailing
>>>>> list users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>>
>>>> _______________________________________________ users mailing
>>>> list users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> -- Edgar Gabriel Assistant Professor Parallel Software Technologies
>>> Lab http://pstl.cs.uh.edu Department of Computer Science
>>> University of Houston Philip G. Hoffman Hall, Room 524
>>> Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax:
>>> +1 (713) 743-3335
>>>
>>> _______________________________________________ users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________ users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users