Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] unknown BTL transport in openmpi 1.5.4 and 1.6.2
From: Wesley Emeneker (Wesley.Emeneker_at_[hidden])
Date: 2013-01-31 09:31:30


We have recently encountered a problem with using openmpi 1.5.3, 1.5.4, and
1.6.2 over compute nodes with two different generations of Infiniband (DDR
and QDR).
This error is very similar to one posted to the list in 2011:
http://www.open-mpi.org/community/lists/users/2011/06/16773.php
This issue was never resolved on the mailing list.

Here is the error:
#################################################################
iwtf-k43-28$ which mpirun
/usr/local/packages/openmpi/1.5.4/gcc-4.4.5/bin/mpirun

iwtf-k43-28$cat machinefile
iwtf-k43-28
iwm-k43-30

iwtf-k43-28$ mpirun -np 2 -hostfile machinefile ./a.out 0
--------------------------------------------------------------------------
Open MPI detected two different OpenFabrics transport types in the same
Infiniband network.
Such mixed network trasport configuration is not supported by Open MPI.

  Local host: iwm-k43-30.pace.gatech.edu
  Local adapter: mthca0 (vendor 0x2c9, part ID 25204)
  Local transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN

  Remote host: iwtf-k43-28
  Remote Adapter: (vendor 0x2c9, part ID 26428)
  Remote transport type: MCA_BTL_OPENIB_TRANSPORT_IB
------------------------------------------------------------------------------------------
Hello from iwtf-k43-28.pace.gatech.edu: 0 of 2
Hello from iwm-k43-30.pace.gatech.edu: 1 of 2
[iwtf-k43-28.pace.gatech.edu:12695] 1 more process has sent help message
help-mpi-btl-openib.txt / conflicting transport types
[iwtf-k43-28.pace.gatech.edu:12695] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
----------------------------------------------------------

iwtf-k43-28$ mpirun -np 2 -hostfile machinefile --mca btl openib,self
./a.out 0
--------------------------------------------------------------------------
Open MPI detected two different OpenFabrics transport types in the same
Infiniband network.
Such mixed network trasport configuration is not supported by Open MPI.

  Local host: iwm-k43-30.pace.gatech.edu
  Local adapter: mthca0 (vendor 0x2c9, part ID 25204)
  Local transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN

  Remote host: iwtf-k43-28
  Remote Adapter: (vendor 0x2c9, part ID 26428)
  Remote transport type: MCA_BTL_OPENIB_TRANSPORT_IB
--------------------------------------------------------------------------
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[34066,1],1]) is on host: iwm-k43-30.pace.gatech.edu
  Process 2 ([[34066,1],0]) is on host: iwtf-k43-28
  BTLs attempted: self openib

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another. This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used. Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[iwm-k43-30.pace.gatech.edu:9131] *** An error occurred in MPI_Init
[iwm-k43-30.pace.gatech.edu:9131] *** on a NULL communicator
[iwm-k43-30.pace.gatech.edu:9131] *** Unknown error
[iwm-k43-30.pace.gatech.edu:9131] *** MPI_ERRORS_ARE_FATAL: your MPI job
will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.

  Reason: Before MPI_INIT completed
  Local host: iwm-k43-30.pace.gatech.edu
  PID: 9131
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 9131 on
node iwm-k43-30 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[iwtf-k43-28.pace.gatech.edu:13279] 1 more process has sent help message
help-mpi-btl-openib.txt / conflicting transport types
[iwtf-k43-28.pace.gatech.edu:13279] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
[iwtf-k43-28.pace.gatech.edu:13279] 1 more process has sent help message
help-mca-bml-r2.txt / unreachable proc
[iwtf-k43-28.pace.gatech.edu:13279] 1 more process has sent help message
help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
[iwtf-k43-28.pace.gatech.edu:13279] 1 more process has sent help message
help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
[iwtf-k43-28.pace.gatech.edu:13279] 1 more process has sent help message
help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed

#################################################################

openmpi 1.4.3 works as expected:
iwtf-k43-28$ which mpirun
/usr/local/packages/openmpi/1.4.3/gcc-4.4.5/bin/mpirun

iwtf-k43-28$ mpicc testmpi.c

iwtf-k43-28$ mpirun -np 2 -hostfile machinefile ./a.out 0
Hello from iwm-k43-30.pace.gatech.edu: 1 of 2
Hello from iwtf-k43-28.pace.gatech.edu: 0 of 2

iwtf-k43-28$ mpirun -np 2 -hostfile machinefile --mca btl openib,self
./a.out 0
Hello from iwm-k43-30.pace.gatech.edu: 1 of 2
Hello from iwtf-k43-28.pace.gatech.edu: 0 of 2
######################################################################

######################################################################
1.5.4 runs fine on iwm-k43-30 by itself:
iwtf-k43-28$ cat machinefile
iwm-k43-30
iwm-k43-30
iwtf-k43-28$ which mpirun
/usr/local/packages/openmpi/1.5.4/gcc-4.4.5/bin/mpirun
iwtf-k43-28$ mpicc testmpi.c
iwtf-k43-28$ mpirun -np 2 -hostfile machinefile --mca btl openib,self
./a.out 0
Hello from iwm-k43-30.pace.gatech.edu: 0 of 2
Hello from iwm-k43-30.pace.gatech.edu: 1 of 2

It is only when mixing and matching hosts that it fails.

######################################################################

Relevant system information:
 - Same error on RHEL6.2 and RHEL6.3.

iwtf-k43-28$ uname -a
Linux iwtf-k43-28.pace.gatech.edu 2.6.32-220.23.1.el6.x86_64 #1 SMP Tue Jun
12 11:20:15 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
iwm-k43-30$ uname -a
Linux iwm-k43-30.pace.gatech.edu 2.6.32-220.23.1.el6.x86_64 #1 SMP Tue Jun
12 11:20:15 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

# rpm -qa | grep -i verb
libibverbs-debuginfo-1.1.4-1.24.gb89d4d7.x86_64
libibverbs-1.1.4-1.24.gb89d4d7.x86_64
libibverbs-devel-static-1.1.4-1.24.gb89d4d7.x86_64
libibverbs-devel-1.1.4-1.24.gb89d4d7.x86_64
libipathverbs-1.2-1.x86_64
libipathverbs-debuginfo-1.2-1.x86_64
libibverbs-utils-1.1.4-1.24.gb89d4d7.x86_64
libipathverbs-devel-1.2-1.x86_64

# rpm -qa | grep libmthca
libmthca-1.0.5-0.1.gbe5eef3.x86_64
libmthca-debuginfo-1.0.5-0.1.gbe5eef3.x86_64
libmthca-devel-static-1.0.5-0.1.gbe5eef3.x86_64

# rpm -qa | grep libmlx
libmlx4-devel-1.0.1-1.20.g6771d22.x86_64
libmlx4-debuginfo-1.0.1-1.20.g6771d22.x86_64
libmlx4-1.0.1-1.20.g6771d22.x86_64

1.4.3 "configure" flags: "--with-tm=/opt/torque/2.4.3
--with-io-romio-flags=\"--with-file-system=nfs+ufs+panfs\" --enable-static"
1.5.4 "configure" flags: "--with-tm=/opt/torque/2.4.3
--with-io-romio-flags=\"--with-file-system=nfs+ufs+panfs\"
--with-hwloc=/usr/local/packages/hwloc/1.2/ --enable-static"
1.6.2 "configure" flags: "--with-tm=/opt/torque/2.4.3
--with-io-romio-flags=\"--with-file-system=nfs+ufs+panfs\" --enable-static
--with-knem"

iwm-k43-30# ibv_devinfo -v
hca_id: mthca0
    transport: InfiniBand (0)
    fw_ver: 1.2.0
    node_guid: 0002:c902:0029:8434
    sys_image_guid: 0002:c902:0029:8437
    vendor_id: 0x02c9
    vendor_part_id: 25204
    hw_ver: 0xA0
    board_id: MT_03B0150002
    phys_port_cnt: 1
    max_mr_size: 0xffffffffffffffff
    page_size_cap: 0xfffff000
    max_qp: 64512
    max_qp_wr: 16384
    device_cap_flags: 0x00001c76
    max_sge: 27
    max_sge_rd: 0
    max_cq: 65408
    max_cqe: 131071
    max_mr: 131056
    max_pd: 32764
    max_qp_rd_atom: 4
    max_ee_rd_atom: 0
    max_res_rd_atom: 258048
    max_qp_init_rd_atom: 128
    max_ee_init_rd_atom: 0
    atomic_cap: ATOMIC_HCA (1)
    max_ee: 0
    max_rdd: 0
    max_mw: 0
    max_raw_ipv6_qp: 0
    max_raw_ethy_qp: 0
    max_mcast_grp: 8192
    max_mcast_qp_attach: 56
    max_total_mcast_qp_attach: 458752
    max_ah: 0
    max_fmr: 0
    max_srq: 960
    max_srq_wr: 16384
    max_srq_sge: 27
    max_pkeys: 64
    local_ca_ack_delay: 15
        port: 1
            state: PORT_ACTIVE (4)
            max_mtu: 2048 (4)
            active_mtu: 2048 (4)
            sm_lid: 465
            port_lid: 54
            port_lmc: 0x00
            link_layer: IB
            max_msg_sz: 0x80000000
            port_cap_flags: 0x02510a68
            max_vl_num: 4 (3)
            bad_pkey_cntr: 0x0
            qkey_viol_cntr: 0x0
            sm_sl: 0
            pkey_tbl_len: 64
            gid_tbl_len: 32
            subnet_timeout: 17
            init_type_reply: 0
            active_width: 4X (2)
            active_speed: 5.0 Gbps (2)
            phys_state: LINK_UP (5)
            GID[ 0]: fe80:0000:0000:0000:0002:c902:0029:8435
##################################################################

##################################################################
iwtf-k43-28# ibv_devinfo -v
hca_id: mlx4_0
    transport: InfiniBand (0)
    fw_ver: 2.9.1000
    node_guid: 0002:c903:004b:2170
    sys_image_guid: 0002:c903:004b:2173
    vendor_id: 0x02c9
    vendor_part_id: 26428
    hw_ver: 0xB0
    board_id: MT_0D90110009
    phys_port_cnt: 1
    max_mr_size: 0xffffffffffffffff
    page_size_cap: 0xfffffe00
    max_qp: 261056
    max_qp_wr: 16351
    device_cap_flags: 0x007c9c76
    max_sge: 32
    max_sge_rd: 0
    max_cq: 65408
    max_cqe: 4194303
    max_mr: 524272
    max_pd: 32764
    max_qp_rd_atom: 16
    max_ee_rd_atom: 0
    max_res_rd_atom: 4176896
    max_qp_init_rd_atom: 128
    max_ee_init_rd_atom: 0
    atomic_cap: ATOMIC_HCA (1)
    max_ee: 0
    max_rdd: 0
    max_mw: 0
    max_raw_ipv6_qp: 0
    max_raw_ethy_qp: 1
    max_mcast_grp: 8192
    max_mcast_qp_attach: 120
    max_total_mcast_qp_attach: 983040
    max_ah: 0
    max_fmr: 0
    max_srq: 65472
    max_srq_wr: 16383
    max_srq_sge: 31
    max_pkeys: 128
    local_ca_ack_delay: 15
        port: 1
            state: PORT_ACTIVE (4)
            max_mtu: 2048 (4)
            active_mtu: 2048 (4)
            sm_lid: 465
            port_lid: 35
            port_lmc: 0x00
            link_layer: IB
            max_msg_sz: 0x40000000
            port_cap_flags: 0x02510868
            max_vl_num: 8 (4)
            bad_pkey_cntr: 0x0
            qkey_viol_cntr: 0x0
            sm_sl: 0
            pkey_tbl_len: 128
            gid_tbl_len: 128
            subnet_timeout: 17
            init_type_reply: 0
            active_width: 4X (2)
            active_speed: 10.0 Gbps (4)
            phys_state: LINK_UP (5)
            GID[ 0]: fe80:0000:0000:0000:0002:c903:004b:2171
##################################################################

-- 
Wesley Emeneker, Research Scientist
The Partnership for an Advanced Computing Environment
Georgia Institute of Technology
404.385.2303
Wesley.Emeneker_at_[hidden]
http://pace.gatech.edu