Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Gleb Natapov (glebn_at_[hidden])
Date: 2007-09-18 11:22:42


On Tue, Sep 18, 2007 at 10:57:38AM -0400, George Bosilca wrote:
> More information about this can be founded in the trac #1127
> (https://svn.open-mpi.org/trac/ompi/ticket/1127).
>
OK. So the code I cited is only a temporary solution. Thanks.

> george.
>
> On Sep 18, 2007, at 10:20 AM, Gleb Natapov wrote:
>
>> On Tue, Sep 18, 2007 at 09:44:42AM -0400, George Bosilca wrote:
>>> The setup of a communicators include as a last stage, a collective
>>> communication. As a result, some of the nodes can exit the collective
>>> before the others and therefore can start sending messages using this
>>> communicator [while some of the other nodes are still waiting for the
>>> collective completion]. This will lead to a situation where a node
>>> receive
>>> a message for a communicator that they are building up.
>>>
>>> There is a bug filled in trac about this. In FT-MPI we temporary put
>>> these
>>> messages in an internal queue, and deliver them to the right communicator
>>> only once this communicator is completely created.
>> In ompi_comm_nextcid() function there is this code for thread_multiple
>> case:
>>
>> /* for synchronization purposes, avoids receiving fragments for
>> a communicator id, which might not yet been known. For single-threaded
>> scenarios, this call is in ompi_comm_activate, for multi-threaded
>> scenarios, it has to be already here ( before releasing another
>> thread into the cid-allocation loop ) */
>> (allredfnct)(&response, &glresponse, 1, MPI_MIN, comm, bridgecomm,
>> local_leader, remote_leader, send_first );
>>
>> This collective is executed on old communicator after setup of a new
>> cid. Is this not enough to solve the problem? Some ranks may leave
>> this collective call earlier than others, but none can leave it before
>> all ranks enter it and at this stage new communicator is already exists
>> in all of them. Do I miss something?
>>
>>
>>>
>>> george.
>>>
>>> On Sep 18, 2007, at 9:06 AM, Gleb Natapov wrote:
>>>
>>>> George,
>>>>
>>>> In the comment you are saying that "a message for a not yet existing
>>>> communicator can happen". Can you explain in what situation it can
>>>> happen?
>>>>
>>>> Thanks,
>>>>
>>>> --
>>>> Gleb.
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> --
>> Gleb.
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
			Gleb.