Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Galen Shipman (gshipman_at_[hidden])
Date: 2007-06-14 10:43:23


>
> The patch applies to ib_multifrag as is without a conflict. But the
> branch
> doesn't compile with or without the patch so I was not able to test
> it.
> Do you have some uncommitted changes that may generate a conflict? Can
> you commit them so they can be resolved? If there is no conflict
> between
> your work and this patch may be it is a good idea to commit it to your
> branch and trunk for testing?
>

I have a whole pile of changes that need to be committed, and even
with these changes, it still doesn't compile as I am reworking names,
and data structures, etc.
I will commit what I have now, and will work on this a bit more over
the weekend.
- Galen

>>
>>>
>>> Thanks,
>>>
>>> Galen
>>>
>>>
>>> On Jun 13, 2007, at 7:27 AM, Gleb Natapov wrote:
>>>
>>>> Hello everyone,
>>>>
>>>> I encountered a problem with openib on depend connection code.
>>>> Basically
>>>> it works only by pure luck if you have more then one endpoint for
>>>> the same
>>>> proc and sometimes breaks in mysterious ways.
>>>>
>>>> The algo works like this: A wants to connect to B so it creates QP
>>>> and sends it
>>>> to B. B receives the QP from A and looks for endpoint that is not
>>>> yet associated
>>>> with remote endpoint, creates QP for it and sends info back. Now A
>>>> receives
>>>> the QP and goes through the same logic as B i.e looks for endpoint
>>>> that is not
>>>> yet connected, BUT there is no guaranty that it will find the
>>>> endpoint that
>>>> initiated the connection in the first place! And if it finds
>>>> another one it will
>>>> create QP for it and will send it back to B and so on and so forth.
>>>> In the end
>>>> I sometimes receive a peculiar mesh of connection where no QP has a
>>>> connection
>>>> back to it from the peer process.
>>>>
>>>> To overcome this problem B needs to send back some info that will
>>>> allow A to
>>>> determine the endpoint that initiated a connection request. The
>>>> lid:qp pair
>>>> will allow for this. But even then the problem will remain if two
>>>> procs initiate
>>>> connection at the same time. To dial with simultaneous connection
>>>> asymmetry
>>>> protocol have to be used one peer became master another slave.
>>>> Slave alway
>>>> initiate a connection to master. Master choose local endpoint to
>>>> satisfy
>>>> incoming request and sends info back to a slave. If master wants to
>>>> initiate a
>>>> connection it send message to a slave and slave initiate connection
>>>> back to
>>>> master.
>>>>
>>>> Included patch implements an algorithm described above and work for
>>>> all
>>>> scenarios for which current code fails to create a connection.
>>>>
>>>> --
>>>> Gleb.
>>>> <fix_openib_wireup.diff>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> --
>> Gleb.
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> --
> Gleb.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel