Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] btl_openib_cpc_include rdmacm questions
From: Ralph Castain (rhc.openmpi_at_[hidden])
Date: 2011-05-11 17:38:12


Sent from my iPad

On May 11, 2011, at 2:05 PM, Brock Palen <brockp_at_[hidden]> wrote:

> On May 9, 2011, at 9:31 AM, Jeff Squyres wrote:
>
>> On May 3, 2011, at 6:42 AM, Dave Love wrote:
>>
>>>> We managed to have another user hit the bug that causes collectives (this time MPI_Bcast() ) to hang on IB that was fixed by setting:
>>>>
>>>> btl_openib_cpc_include rdmacm
>>>
>>> Could someone explain this? We also have problems with collective hangs
>>> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
>>> see any relevant issues filed. However, rdmacm isn't an available value
>>> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
>>> that I understand what these things are...).
>>
>> Sorry for the delay -- perhaps an IB vendor can reply here with more detail...
>>
>> We had a user-reported issue of some hangs that the IB vendors have been unable to replicate in their respective labs. We *suspect* that it may be an issue with the oob openib CPC, but that code is pretty old and pretty mature, so all of us would be at least somewhat surprised if that were the case. If anyone can reliably reproduce this error, please let us know and/or give us access to your machines -- we have not closed this issue, but are unable to move forward because the customers who reported this issue switched to rdmacm and moved on (i.e., we don't have access to their machines to test any more).
>
> An update, we set all our ib0 interfaces to have IP's on a 172. network. This allowed the use of rdmacm to work and get latencies that we would expect. That said we are still getting hangs. I can very reliably reproduce it using IMB with a specific core count on a specific test case.
>
> Just an update. Has anyone else had luck fixing the lockup issues on openib BTL for collectives in some cases? Thanks!

I'll go back to my earlier comments. Users always claim that their code doesn't have the sync issue, but it has proved to help more often than not, and costs nothing to try,

My $.0002

>
>
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp_at_[hidden]
> (734)936-1985
>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users