Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] InfiniBand path migration not working
From: Jeremy (spritzydog_at_[hidden])
Date: 2012-02-22 13:44:10


Hi,

I am have a problem getting Alternative Path Migration (APM) to work
over the InfiniBand ports on my HCA.

Details on my configuration and the issue I have are below. Please
let me know if you can provide any suggestions or corrections to my
configuration? I will be happy to try other experiments and tests or
provide additional details to debug this problem further.

I have reviewed the Open MPI FAQ and the archive of this mailing list
but I was unable to resolve my problem. There was one thread on
mult-rail fail-over with IB but it did not provide sufficient
information.

Thanks for your help,
Jeremy

Configuration:
MPI version 1.4.3 Bundled with OFED.
I have also tested with MPI version 1.5.4 but the results were the same.

I have 2 machines, each machine has a dual port Mellanox IB HCA
Mellanox MCX354A-FCBT (ConnectX-3 FDR).
I have cabled both ports of each HCA to the same IB Switch (Mellanox SX6036).

What I expected to happen:
I am trying to migrate data transmission between 2 ports of the same HCA.
Start an MPI application. Unplug the fiber cable from Port 1 of an
HCA. Observe that the MPI application continues and data is sent
across Port 2 of the HCA.

However, when I unplug the cable from Port 1 of the IB HCA, the MPI
application hangs and I get the following error messages:
Error 10: IBV_EVENT_PORT_ERR
Error 7: IBV_EVENT_PATH_MIG_ERR
Alternative path migration event reported
Trying to find additional path…
APM: already all ports were used port_num 2 apm_port 2

I've pasted the full verbose error message at the bottom of this email.

I started the MPI application using the following mpirun invocation:
mpirun –np 2 –machinefile machines –mca btl_openib_enable_apm_over_ports 1 demo

What works:
I think that the low level Mellanox IB hardware is working as
expected. The switch, cables and both HCA ports move data OK.
If I don't use the btl_openib_enable_apm_over_ports option then MPI
traffic is evenly spread across both Port 1 and Port 2 while it is
running.
Also, I am able to successfully do fail-over using a bonded device
with IP. For example, if I use netperf to send TCP data over a bonded
IPoIB device I get the expected behavior. When I unplug Port 1,
netperf keeps running and traffic goes over Port 2.

Detailed Error Message:
--------------------------------------------------------------------------
The OpenFabrics stack has reported a network error event. Open MPI
will try to continue, but your job may end up failing.

  Local host: bones
  MPI process PID: 23115
  Error number: 10 (IBV_EVENT_PORT_ERR)

This error may indicate connectivity problems within the fabric;
please contact your system administrator.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The OpenFabrics stack has reported a network error event. Open MPI
will try to continue, but your job may end up failing.

  Local host: bones
  MPI process PID: 23115
  Error number: 7 (IBV_EVENT_PATH_MIG_ERR)

This error may indicate connectivity problems within the fabric;
please contact your system administrator.
--------------------------------------------------------------------------
[bones][[57528,1],0][btl_openib_async.c:327:btl_openib_async_deviceh]
Alternative path migration event reported
[bones][[57528,1],0][btl_openib_async.c:329:btl_openib_async_deviceh]
Trying to find additional path...
[bones][[57528,1],0][btl_openib_async.c:516:apm_update_port] APM:
already all ports were used port_num 2 apm_port 2