Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] InfiniBand path migration not working
From: Shamis, Pavel (shamisp_at_[hidden])
Date: 2012-03-21 10:53:02


Jeremy,

As far as I understand the tool that Evgeny recommended showed that the remote port is reachable.
Based on the log that have been provided I can't find the issue in ompi, everything seems to be kosher.
Unfortunately, I do not have a platform where I may try to reproduce the issue. I would as Evegeny,
maybe Mellanox will be able to reproduce and debug the issue.

Pavel (Pasha) Shamis

---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory
On Mar 21, 2012, at 9:31 AM, Jeremy wrote:
> Hi Pasha,
> 
> I just wanted to check if you had any further suggestions regarding
> the APM issue based on the updated info in my previous email.
> 
> Thanks,
> 
> -Jeremy
> 
> On Mon, Mar 12, 2012 at 12:43 PM, Jeremy <spritzydog_at_[hidden]> wrote:
>> Hi Pasha, Yevgeny,
>> 
>>>> My educated guess is that from some reason it is no direct connection path
>>>> between lid-2 and lid-4. To prove it we have to look and the OpenSM routing
>>>> information.
>> 
>>> If you don't get response or you get info of
>>> the device different that what you would expect,
>>> then the two ports are not part of the same
>>> subnet, and APN is expected to fail.
>>> Otherwise - it's probably a bug.
>> 
>> I've tried your suggestions and the details are below.  I am now
>> testing with a trivial MPI application that just does an
>> MPI_Send/MPI_Recv and then sleeps for a while (attached).  There is
>> much less output to weed through now!
>> 
>> When I unplug a cable from Port 1, the LID associated with Port 2 is
>> still reachable with smpquery.  So it looks like there should be a
>> valid path to migrate to on the same  subnet.
>> 
>> I am using 2 hosts in this output
>> sulu:  This is the host where I unplug the cable from Port 1. The
>> cable on Port 2 is connected all the time. LIDs 4 and 5.
>> bones:  On this host I leave cables connected to both Ports all the
>> time.LIDs 2 and 3.
>> 
>> A) Before I start, sulu shows that both Ports are up and active using
>> LIDs 4 and 5:
>> sulu> ibstatus
>> Infiniband device 'mlx4_0' port 1 status:
>>        default gid:     fe80:0000:0000:0000:0002:c903:0033:6fe1
>>        base lid:        0x4
>>        sm lid:          0x6
>>        state:           4: ACTIVE
>>        phys state:      5: LinkUp
>>        rate:            56 Gb/sec (4X FDR)
>>        link_layer:      InfiniBand
>> 
>> Infiniband device 'mlx4_0' port 2 status:
>>        default gid:     fe80:0000:0000:0000:0002:c903:0033:6fe2
>>        base lid:        0x5
>>        sm lid:          0x6
>>        state:           4: ACTIVE
>>        phys state:      5: LinkUp
>>        rate:            56 Gb/sec (4X FDR)
>>        link_layer:      InfiniBand
>> 
>> B) The other host, bones, is able to get to LIDs 4 and 5 OK:
>> bones> smpquery --Ca mlx4_0 --Port 1 NodeInfo 4
>> # Node info: Lid 4
>> BaseVers:........................1
>> ClassVers:.......................1
>> NodeType:........................Channel Adapter
>> NumPorts:........................2
>> SystemGuid:......................0x0002c90300336fe3
>> Guid:............................0x0002c90300336fe0
>> PortGuid:........................0x0002c90300336fe1
>> PartCap:.........................128
>> DevId:...........................0x1003
>> Revision:........................0x00000000
>> LocalPort:.......................1
>> VendorId:........................0x0002c9
>> 
>> bones> smpquery --Ca mlx4_0 --Port 1 NodeInfo 5
>> # Node info: Lid 5
>> BaseVers:........................1
>> ClassVers:.......................1
>> NodeType:........................Channel Adapter
>> NumPorts:........................2
>> SystemGuid:......................0x0002c90300336fe3
>> Guid:............................0x0002c90300336fe0
>> PortGuid:........................0x0002c90300336fe2
>> PartCap:.........................128
>> DevId:...........................0x1003
>> Revision:........................0x00000000
>> LocalPort:.......................2
>> VendorId:........................0x0002c9
>> 
>> C) I start the MPI program.  See attached file for output.
>> 
>> D) During Iteration 3, I unplugged the cable on Port 1 of sulu.
>> - I get the expected network error event message.
>> - sulu shows that Port 1 is down and Port 2 is active as expected.
>> - bones is still able to get to LID 5 on Port 2 of sulu as expected.
>> - The MPI application hangs and then terminates instead of running via LID 5.
>> 
>> sulu> ibstatus
>> Infiniband device 'mlx4_0' port 1 status:
>>        default gid:     fe80:0000:0000:0000:0002:c903:0033:6fe1
>>        base lid:        0x4
>>        sm lid:          0x6
>>        state:           1: DOWN
>>        phys state:      2: Polling
>>        rate:            40 Gb/sec (4X QDR)
>>        link_layer:      InfiniBand
>> 
>> Infiniband device 'mlx4_0' port 2 status:
>>        default gid:     fe80:0000:0000:0000:0002:c903:0033:6fe2
>>        base lid:        0x5
>>        sm lid:          0x6
>>        state:           4: ACTIVE
>>        phys state:      5: LinkUp
>>        rate:            56 Gb/sec (4X FDR)
>>        link_layer:      InfiniBand
>> 
>> bones> smpquery --Ca mlx4_0 --Port 1 NodeInfo 4
>> ibwarn: [11192] mad_rpc: _do_madrpc failed; dport (Lid 4)
>> smpquery: iberror: failed: operation NodeInfo: node info query failed
>> 
>> bones> smpquery --Ca mlx4_0 --Port 1 NodeInfo 5
>> # Node info: Lid 5
>> BaseVers:........................1
>> ClassVers:.......................1
>> NodeType:........................Channel Adapter
>> NumPorts:........................2
>> SystemGuid:......................0x0002c90300336fe3
>> Guid:............................0x0002c90300336fe0
>> PortGuid:........................0x0002c90300336fe2
>> PartCap:.........................128
>> DevId:...........................0x1003
>> Revision:........................0x00000000
>> LocalPort:.......................2
>> VendorId:........................0x0002c9
>> 
>> Thanks,
>> 
>> -Jeremy