Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] InfiniBand path migration not working
From: Shamis, Pavel (shamisp_at_[hidden])
Date: 2012-02-22 22:43:11


Jeremy,
I implemented the APM support for openib btl a long time ago. I do not remember all the details of the implementation, but I remember that it is used to support LMC bits and multiple ib ports. Unfortunately I'm super busy this week. I will try look at it early next week.

Pavel (Pasha) Shamis

---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory
On Feb 22, 2012, at 1:44 PM, Jeremy wrote:
> Hi,
> 
> I am have a problem getting Alternative Path Migration (APM) to work
> over the InfiniBand ports on my HCA.
> 
> Details on my configuration and the issue I have are below.  Please
> let me know if you can provide any suggestions or corrections to my
> configuration?  I will be happy to try other experiments and tests or
> provide additional details to debug this problem further.
> 
> I have reviewed the Open MPI FAQ and the archive of this mailing list
> but I was unable to resolve my problem.  There was one thread on
> mult-rail fail-over with IB but it did not provide sufficient
> information.
> 
> Thanks for your help,
> Jeremy
> 
> Configuration:
> MPI version 1.4.3 Bundled with OFED.
> I have also tested with MPI version 1.5.4 but the results were the same.
> 
> I have 2 machines, each machine has a dual port Mellanox IB HCA
> Mellanox MCX354A-FCBT (ConnectX-3 FDR).
> I have cabled both ports of each HCA to the same IB Switch (Mellanox SX6036).
> 
> What I expected to happen:
> I am trying to migrate data transmission between 2 ports of the same HCA.
> Start an MPI application.  Unplug the fiber cable from Port 1 of an
> HCA.  Observe that the MPI application continues and data is sent
> across Port 2 of the HCA.
> 
> However, when I unplug the cable from Port 1 of the IB HCA, the MPI
> application hangs and I get the following error messages:
> Error 10: IBV_EVENT_PORT_ERR
> Error 7: IBV_EVENT_PATH_MIG_ERR
> Alternative path migration event reported
> Trying to find additional path…
> APM: already all ports were used port_num 2 apm_port 2
> 
> I've pasted the full verbose error message at the bottom of this email.
> 
> I started the MPI application using the following mpirun invocation:
> mpirun –np 2 –machinefile machines –mca btl_openib_enable_apm_over_ports 1 demo
> 
> What works:
> I think that the low level Mellanox IB hardware is working as
> expected.  The switch, cables and both HCA ports move data OK.
> If I don't use the btl_openib_enable_apm_over_ports option then MPI
> traffic is evenly spread across both Port 1 and Port 2 while it is
> running.
> Also, I am able to successfully do fail-over using a bonded device
> with IP. For example, if I use netperf to send TCP data over a bonded
> IPoIB device I get the expected behavior.  When I unplug Port 1,
> netperf keeps running and traffic goes over Port 2.
> 
> Detailed Error Message:
> --------------------------------------------------------------------------
> The OpenFabrics stack has reported a network error event.  Open MPI
> will try to continue, but your job may end up failing.
> 
>  Local host:        bones
>  MPI process PID:   23115
>  Error number:      10 (IBV_EVENT_PORT_ERR)
> 
> This error may indicate connectivity problems within the fabric;
> please contact your system administrator.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> The OpenFabrics stack has reported a network error event.  Open MPI
> will try to continue, but your job may end up failing.
> 
>  Local host:        bones
>  MPI process PID:   23115
>  Error number:      7 (IBV_EVENT_PATH_MIG_ERR)
> 
> This error may indicate connectivity problems within the fabric;
> please contact your system administrator.
> --------------------------------------------------------------------------
> [bones][[57528,1],0][btl_openib_async.c:327:btl_openib_async_deviceh]
> Alternative path migration event reported
> [bones][[57528,1],0][btl_openib_async.c:329:btl_openib_async_deviceh]
> Trying to find additional path...
> [bones][[57528,1],0][btl_openib_async.c:516:apm_update_port] APM:
> already all ports were used port_num 2 apm_port 2
> 
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users