Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] InfiniBand path migration not working
From: Jeremy (spritzydog_at_[hidden])
Date: 2012-03-21 09:31:34


Hi Pasha,

I just wanted to check if you had any further suggestions regarding
the APM issue based on the updated info in my previous email.

Thanks,

-Jeremy

On Mon, Mar 12, 2012 at 12:43 PM, Jeremy <spritzydog_at_[hidden]> wrote:
> Hi Pasha, Yevgeny,
>
>>> My educated guess is that from some reason it is no direct connection path
>>> between lid-2 and lid-4. To prove it we have to look and the OpenSM routing
>>> information.
>
>> If you don't get response or you get info of
>> the device different that what you would expect,
>> then the two ports are not part of the same
>> subnet, and APN is expected to fail.
>> Otherwise - it's probably a bug.
>
> I've tried your suggestions and the details are below.  I am now
> testing with a trivial MPI application that just does an
> MPI_Send/MPI_Recv and then sleeps for a while (attached).  There is
> much less output to weed through now!
>
> When I unplug a cable from Port 1, the LID associated with Port 2 is
> still reachable with smpquery.  So it looks like there should be a
> valid path to migrate to on the same  subnet.
>
> I am using 2 hosts in this output
> sulu:  This is the host where I unplug the cable from Port 1. The
> cable on Port 2 is connected all the time. LIDs 4 and 5.
> bones:  On this host I leave cables connected to both Ports all the
> time.LIDs 2 and 3.
>
> A) Before I start, sulu shows that both Ports are up and active using
> LIDs 4 and 5:
> sulu> ibstatus
> Infiniband device 'mlx4_0' port 1 status:
>        default gid:     fe80:0000:0000:0000:0002:c903:0033:6fe1
>        base lid:        0x4
>        sm lid:          0x6
>        state:           4: ACTIVE
>        phys state:      5: LinkUp
>        rate:            56 Gb/sec (4X FDR)
>        link_layer:      InfiniBand
>
> Infiniband device 'mlx4_0' port 2 status:
>        default gid:     fe80:0000:0000:0000:0002:c903:0033:6fe2
>        base lid:        0x5
>        sm lid:          0x6
>        state:           4: ACTIVE
>        phys state:      5: LinkUp
>        rate:            56 Gb/sec (4X FDR)
>        link_layer:      InfiniBand
>
> B) The other host, bones, is able to get to LIDs 4 and 5 OK:
> bones> smpquery --Ca mlx4_0 --Port 1 NodeInfo 4
> # Node info: Lid 4
> BaseVers:........................1
> ClassVers:.......................1
> NodeType:........................Channel Adapter
> NumPorts:........................2
> SystemGuid:......................0x0002c90300336fe3
> Guid:............................0x0002c90300336fe0
> PortGuid:........................0x0002c90300336fe1
> PartCap:.........................128
> DevId:...........................0x1003
> Revision:........................0x00000000
> LocalPort:.......................1
> VendorId:........................0x0002c9
>
> bones> smpquery --Ca mlx4_0 --Port 1 NodeInfo 5
> # Node info: Lid 5
> BaseVers:........................1
> ClassVers:.......................1
> NodeType:........................Channel Adapter
> NumPorts:........................2
> SystemGuid:......................0x0002c90300336fe3
> Guid:............................0x0002c90300336fe0
> PortGuid:........................0x0002c90300336fe2
> PartCap:.........................128
> DevId:...........................0x1003
> Revision:........................0x00000000
> LocalPort:.......................2
> VendorId:........................0x0002c9
>
> C) I start the MPI program.  See attached file for output.
>
> D) During Iteration 3, I unplugged the cable on Port 1 of sulu.
> - I get the expected network error event message.
> - sulu shows that Port 1 is down and Port 2 is active as expected.
> - bones is still able to get to LID 5 on Port 2 of sulu as expected.
> - The MPI application hangs and then terminates instead of running via LID 5.
>
> sulu> ibstatus
> Infiniband device 'mlx4_0' port 1 status:
>        default gid:     fe80:0000:0000:0000:0002:c903:0033:6fe1
>        base lid:        0x4
>        sm lid:          0x6
>        state:           1: DOWN
>        phys state:      2: Polling
>        rate:            40 Gb/sec (4X QDR)
>        link_layer:      InfiniBand
>
> Infiniband device 'mlx4_0' port 2 status:
>        default gid:     fe80:0000:0000:0000:0002:c903:0033:6fe2
>        base lid:        0x5
>        sm lid:          0x6
>        state:           4: ACTIVE
>        phys state:      5: LinkUp
>        rate:            56 Gb/sec (4X FDR)
>        link_layer:      InfiniBand
>
> bones> smpquery --Ca mlx4_0 --Port 1 NodeInfo 4
> ibwarn: [11192] mad_rpc: _do_madrpc failed; dport (Lid 4)
> smpquery: iberror: failed: operation NodeInfo: node info query failed
>
> bones> smpquery --Ca mlx4_0 --Port 1 NodeInfo 5
> # Node info: Lid 5
> BaseVers:........................1
> ClassVers:.......................1
> NodeType:........................Channel Adapter
> NumPorts:........................2
> SystemGuid:......................0x0002c90300336fe3
> Guid:............................0x0002c90300336fe0
> PortGuid:........................0x0002c90300336fe2
> PartCap:.........................128
> DevId:...........................0x1003
> Revision:........................0x00000000
> LocalPort:.......................2
> VendorId:........................0x0002c9
>
> Thanks,
>
> -Jeremy