Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] RoCE (IBoE) & OpenMPI
From: Shamis, Pavel (shamisp_at_[hidden])
Date: 2011-02-23 15:54:06


Here is what OFA says:
http://www.google.com/url?sa=t&source=web&cd=1&ved=0CBMQFjAA&url=http%3A%2F%2Fwww.openfabrics.org%2Farchives%2Fspring2010sonoma%2FWednesday%2FLiran%2520Liss%2520RoCE%2520in%2520OFED%2Frocee_update_liss.ppt&ei=QW9lTfO-L8HYgQf2tdHhBw&usg=AFQjCNEPltfVwWeZ2d4uvaj1wMpumcxrEw&sig2=PpybWkpAlTR417uCB4guaQ
-----
Slide 7:
Connection manager.

•Based on RDMACM
–OS IP stack used to resolve remote IP to DMAC and bind to outgoing Ethernet interface
        •VLAN determined according to bound netdev
        •RoCEE device selected accordingly
–Network parameters (MTU, SL, timeout) obtained locally according to kernel policy
–Connection proceeds with CM as in IB
------

It means that you have to bind device to specific vlan, and then RDMACM automatically will obtain
SL/MTU/etc...So RDMACM supposed to hide all these "ib" details.

I remember that I updated the trunk to select by default RDMACM connection manager for RoCE ports - https://svn.open-mpi.org/trac/ompi/changeset/22311

I'm not sure it the change made his way to any production version. I don't work on this part code anymore :-)

Regards,

Pavel (Pasha) Shamis

---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory
On Feb 22, 2011, at 6:21 PM, Michael Shuey wrote:
> Could you re-enable the SL param (btl_openib_ib_service_level) for
> RoCE?  Jeff was kind enough to provide a patch to let me specify the
> gid_index, but that doesn't seem to be working.  To get RoCE to work
> correctly (at least, on Nexus switches) I'll need to specify both a
> gid_index and an IB service level.  I think. :-)
> 
> Also, while the rdmacm connection manager is required for RoCE, it's
> not selected by default (like it is for iWARP).  You still need to add
> that to a config file or command line, or you get a rather cryptic
> option (at least up through OpenMPI 1.5.1).
> 
> --
> Mike Shuey
> 
> 
> 
> On Mon, Feb 21, 2011 at 12:34 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>> Random thought: is there a check to ensure that the SL MCA param is not set in a RoCE environment?  If not, we should probably add a show_help warning if the SL MCA param is set when using RoCE (i.e., that its value will be ignored).
>> 
>> 
>> On Feb 19, 2011, at 12:22 AM, Shamis, Pavel wrote:
>> 
>>> As far as I remember we don't allow to user to specify SL for RoCE. RoCE considered kinda ethernet device and RDMACM connection manager is used to setup the connections. it means that in order to select network X  or Y, you may use ip/netmask (btl_openib_ipaddr_include) .
>>> 
>>> Pavel (Pasha) Shamis
>>> ---
>>> Application Performance Tools Group
>>> Computer Science and Math Division
>>> Oak Ridge National Laboratory
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Feb 18, 2011, at 4:14 PM, Michael Shuey wrote:
>>> 
>>>> Per-node GID & SL settings == bad.  Site-wide GID & SL settings == good.
>>>> 
>>>> If this could be an MCA param (like btl_openib_ib_service_level)
>>>> that'd be great - we already have a global config file of similar
>>>> params.  We'd definitely want the same N everywhere.
>>>> 
>>>> --
>>>> Mike Shuey
>>>> 
>>>> 
>>>> 
>>>> On Fri, Feb 18, 2011 at 3:44 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>>>>> On Feb 18, 2011, at 1:39 PM, Michael Shuey wrote:
>>>>> 
>>>>>> RoCE HCAs keep a GID table, like normal HCAs.  Every time you bring up
>>>>>> a vlan interface, another entry gets automatically added to the table.
>>>>>> If I select one of these other GIDs, packets get a VLAN tag, and that
>>>>>> contains the necessary priority bits (well, assuming I selected the
>>>>>> right IB service level, which is mapped to the priority tag in the
>>>>>> VLAN header) for the traffic to match a lossless class of service on
>>>>>> the switch.
>>>>> 
>>>>> Ah -- I see it now (it's been a looong time since I've looked in Open MPI's verbs code!).  We query and simply take the 0th GID from a given IBV device port's GID table.
>>>>> 
>>>>>> For this to work, I really need for the IB client to select a
>>>>>> non-default GID.  A few test programs included in OFED will do this,
>>>>>> but I'm not sure OpenMPI will.  Any thoughts?
>>>>> 
>>>>> Yes, we can do this.  It's pretty easy to add an MCA parameter to select the Nth GID rather than always taking the 0th.
>>>>> 
>>>>> To make this simple, can you make it so that the value of N is the same across all nodes in your cluster?  Then you can set a site-wide MCA param for that value of N and be done with this issue.  If we have to have a per-node setting of N, it could get a little hairy (it's do-able, but... it's a heckuva lot easier if N is the same everywhere).
>>>>> 
>>>>> --
>>>>> Jeff Squyres
>>>>> jsquyres_at_[hidden]
>>>>> For corporate legal information go to:
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>> 
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> 
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>