Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] btl_openib_cpc_include rdmacm questions
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-05-09 09:29:06


Sorry for the delay on this -- it looks like the problem is caused by messages like this (from your first message):

[nyx0665.engin.umich.edu:06399] openib BTL: rdmacm IP address not found on port

RDMA CM requires IP addresses (i.e., IPoIB) to be enabled on every port/LID where you want to use it.

On May 5, 2011, at 1:15 PM, Brock Palen wrote:

> Yeah we have ran into more issues, with rdmacm not being avialable on all of our hosts. So it would be nice to know what we can do to test that a host would support rdmacm,
>
> Example:
>
> --------------------------------------------------------------------------
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port. As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
>
> Local host: nyx5067.engin.umich.edu
> Local device: mlx4_0
> Local port: 1
> CPCs attempted: rdmacm
> --------------------------------------------------------------------------
>
> This is one of our QDR hosts that rdmacm generally works on. Which this code (CRASH) requires to avoid a collective hang in MPI_Allreduce()
>
> I look on this hosts and I find:
> [root_at_nyx5067 ~]# rpm -qa | grep rdma
> librdmacm-1.0.11-1
> librdmacm-1.0.11-1
> librdmacm-devel-1.0.11-1
> librdmacm-devel-1.0.11-1
> librdmacm-utils-1.0.11-1
>
> So all the libraries are installed (I think) is there a way to verify this? Or to have OpenMPI be more verbose what caused rdmacm to fail as an oob option?
>
>
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp_at_[hidden]
> (734)936-1985
>
>
>
> On May 3, 2011, at 9:42 AM, Dave Love wrote:
>
>> Brock Palen <brockp_at_[hidden]> writes:
>>
>>> We managed to have another user hit the bug that causes collectives (this time MPI_Bcast() ) to hang on IB that was fixed by setting:
>>>
>>> btl_openib_cpc_include rdmacm
>>
>> Could someone explain this? We also have problems with collective hangs
>> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
>> see any relevant issues filed. However, rdmacm isn't an available value
>> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
>> that I understand what these things are...).
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/