We have recently encountered a problem with using openmpi 1.5.3, 1.5.4, and
1.6.2 over compute nodes with two different generations of Infiniband (DDR
and QDR).
This error is very similar to one posted to the list in 2011:
http://www.open-mpi.org/community/lists/users/2011/06/16773.php
This issue was never resolved on the mailing list.
Here is the error:
#################################################################
iwtf-k43-28$ which mpirun
/usr/local/packages/openmpi/1.5.4/gcc-4.4.5/bin/mpirun
iwtf-k43-28$cat machinefile
iwtf-k43-28
iwm-k43-30
iwtf-k43-28$ mpirun -np 2 -hostfile machinefile ./a.out 0
--------------------------------------------------------------------------
Open MPI detected two different OpenFabrics transport types in the same
Infiniband network.
Such mixed network trasport configuration is not supported by Open MPI.
Local host: iwm-k43-30.pace.gatech.edu
Local adapter: mthca0 (vendor 0x2c9, part ID 25204)
Local transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
Remote host: iwtf-k43-28
Remote Adapter: (vendor 0x2c9, part ID 26428)
Remote transport type: MCA_BTL_OPENIB_TRANSPORT_IB
------------------------------------------------------------------------------------------
Hello from iwtf-k43-28.pace.gatech.edu: 0 of 2
Hello from iwm-k43-30.pace.gatech.edu: 1 of 2
[iwtf-k43-28.pace.gatech.edu:12695] 1 more process has sent help message
help-mpi-btl-openib.txt / conflicting transport types
[iwtf-k43-28.pace.gatech.edu:12695] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
----------------------------------------------------------
iwtf-k43-28$ mpirun -np 2 -hostfile machinefile --mca btl openib,self
./a.out 0
--------------------------------------------------------------------------
Open MPI detected two different OpenFabrics transport types in the same
Infiniband network.
Such mixed network trasport configuration is not supported by Open MPI.
Local host: iwm-k43-30.pace.gatech.edu
Local adapter: mthca0 (vendor 0x2c9, part ID 25204)
Local transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
Remote host: iwtf-k43-28
Remote Adapter: (vendor 0x2c9, part ID 26428)
Remote transport type: MCA_BTL_OPENIB_TRANSPORT_IB
--------------------------------------------------------------------------
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[34066,1],1]) is on host: iwm-k43-30.pace.gatech.edu
Process 2 ([[34066,1],0]) is on host: iwtf-k43-28
BTLs attempted: self openib
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another. This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used. Your MPI job will now abort.
You may wish to try to narrow down the problem;
* Check the output of ompi_info to see which BTL/MTL plugins are
available.
* Run your application with MPI_THREAD_SINGLE.
* Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
if using MTL-based communications) to see exactly which
communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[iwm-k43-30.pace.gatech.edu:9131] *** An error occurred in MPI_Init
[iwm-k43-30.pace.gatech.edu:9131] *** on a NULL communicator
[iwm-k43-30.pace.gatech.edu:9131] *** Unknown error
[iwm-k43-30.pace.gatech.edu:9131] *** MPI_ERRORS_ARE_FATAL: your MPI job
will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.
Reason: Before MPI_INIT completed
Local host: iwm-k43-30.pace.gatech.edu
PID: 9131
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 9131 on
node iwm-k43-30 exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[iwtf-k43-28.pace.gatech.edu:13279] 1 more process has sent help message
help-mpi-btl-openib.txt / conflicting transport types
[iwtf-k43-28.pace.gatech.edu:13279] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
[iwtf-k43-28.pace.gatech.edu:13279] 1 more process has sent help message
help-mca-bml-r2.txt / unreachable proc
[iwtf-k43-28.pace.gatech.edu:13279] 1 more process has sent help message
help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
[iwtf-k43-28.pace.gatech.edu:13279] 1 more process has sent help message
help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
[iwtf-k43-28.pace.gatech.edu:13279] 1 more process has sent help message
help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
#################################################################
openmpi 1.4.3 works as expected:
iwtf-k43-28$ which mpirun
/usr/local/packages/openmpi/1.4.3/gcc-4.4.5/bin/mpirun
iwtf-k43-28$ mpicc testmpi.c
iwtf-k43-28$ mpirun -np 2 -hostfile machinefile ./a.out 0
Hello from iwm-k43-30.pace.gatech.edu: 1 of 2
Hello from iwtf-k43-28.pace.gatech.edu: 0 of 2
iwtf-k43-28$ mpirun -np 2 -hostfile machinefile --mca btl openib,self
./a.out 0
Hello from iwm-k43-30.pace.gatech.edu: 1 of 2
Hello from iwtf-k43-28.pace.gatech.edu: 0 of 2
######################################################################
######################################################################
1.5.4 runs fine on iwm-k43-30 by itself:
iwtf-k43-28$ cat machinefile
iwm-k43-30
iwm-k43-30
iwtf-k43-28$ which mpirun
/usr/local/packages/openmpi/1.5.4/gcc-4.4.5/bin/mpirun
iwtf-k43-28$ mpicc testmpi.c
iwtf-k43-28$ mpirun -np 2 -hostfile machinefile --mca btl openib,self
./a.out 0
Hello from iwm-k43-30.pace.gatech.edu: 0 of 2
Hello from iwm-k43-30.pace.gatech.edu: 1 of 2
It is only when mixing and matching hosts that it fails.
######################################################################
Relevant system information:
- Same error on RHEL6.2 and RHEL6.3.
iwtf-k43-28$ uname -a
Linux iwtf-k43-28.pace.gatech.edu 2.6.32-220.23.1.el6.x86_64 #1 SMP Tue Jun
12 11:20:15 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
iwm-k43-30$ uname -a
Linux iwm-k43-30.pace.gatech.edu 2.6.32-220.23.1.el6.x86_64 #1 SMP Tue Jun
12 11:20:15 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
# rpm -qa | grep -i verb
libibverbs-debuginfo-1.1.4-1.24.gb89d4d7.x86_64
libibverbs-1.1.4-1.24.gb89d4d7.x86_64
libibverbs-devel-static-1.1.4-1.24.gb89d4d7.x86_64
libibverbs-devel-1.1.4-1.24.gb89d4d7.x86_64
libipathverbs-1.2-1.x86_64
libipathverbs-debuginfo-1.2-1.x86_64
libibverbs-utils-1.1.4-1.24.gb89d4d7.x86_64
libipathverbs-devel-1.2-1.x86_64
# rpm -qa | grep libmthca
libmthca-1.0.5-0.1.gbe5eef3.x86_64
libmthca-debuginfo-1.0.5-0.1.gbe5eef3.x86_64
libmthca-devel-static-1.0.5-0.1.gbe5eef3.x86_64
# rpm -qa | grep libmlx
libmlx4-devel-1.0.1-1.20.g6771d22.x86_64
libmlx4-debuginfo-1.0.1-1.20.g6771d22.x86_64
libmlx4-1.0.1-1.20.g6771d22.x86_64
1.4.3 "configure" flags: "--with-tm=/opt/torque/2.4.3
--with-io-romio-flags=\"--with-file-system=nfs+ufs+panfs\" --enable-static"
1.5.4 "configure" flags: "--with-tm=/opt/torque/2.4.3
--with-io-romio-flags=\"--with-file-system=nfs+ufs+panfs\"
--with-hwloc=/usr/local/packages/hwloc/1.2/ --enable-static"
1.6.2 "configure" flags: "--with-tm=/opt/torque/2.4.3
--with-io-romio-flags=\"--with-file-system=nfs+ufs+panfs\" --enable-static
--with-knem"
iwm-k43-30# ibv_devinfo -v
hca_id: mthca0
transport: InfiniBand (0)
fw_ver: 1.2.0
node_guid: 0002:c902:0029:8434
sys_image_guid: 0002:c902:0029:8437
vendor_id: 0x02c9
vendor_part_id: 25204
hw_ver: 0xA0
board_id: MT_03B0150002
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffff000
max_qp: 64512
max_qp_wr: 16384
device_cap_flags: 0x00001c76
max_sge: 27
max_sge_rd: 0
max_cq: 65408
max_cqe: 131071
max_mr: 131056
max_pd: 32764
max_qp_rd_atom: 4
max_ee_rd_atom: 0
max_res_rd_atom: 258048
max_qp_init_rd_atom: 128
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 0
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 8192
max_mcast_qp_attach: 56
max_total_mcast_qp_attach: 458752
max_ah: 0
max_fmr: 0
max_srq: 960
max_srq_wr: 16384
max_srq_sge: 27
max_pkeys: 64
local_ca_ack_delay: 15
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 465
port_lid: 54
port_lmc: 0x00
link_layer: IB
max_msg_sz: 0x80000000
port_cap_flags: 0x02510a68
max_vl_num: 4 (3)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 64
gid_tbl_len: 32
subnet_timeout: 17
init_type_reply: 0
active_width: 4X (2)
active_speed: 5.0 Gbps (2)
phys_state: LINK_UP (5)
GID[ 0]: fe80:0000:0000:0000:0002:c902:0029:8435
##################################################################
##################################################################
iwtf-k43-28# ibv_devinfo -v
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.9.1000
node_guid: 0002:c903:004b:2170
sys_image_guid: 0002:c903:004b:2173
vendor_id: 0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id: MT_0D90110009
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffe00
max_qp: 261056
max_qp_wr: 16351
device_cap_flags: 0x007c9c76
max_sge: 32
max_sge_rd: 0
max_cq: 65408
max_cqe: 4194303
max_mr: 524272
max_pd: 32764
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 4176896
max_qp_init_rd_atom: 128
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 0
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 1
max_mcast_grp: 8192
max_mcast_qp_attach: 120
max_total_mcast_qp_attach: 983040
max_ah: 0
max_fmr: 0
max_srq: 65472
max_srq_wr: 16383
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 15
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 465
port_lid: 35
port_lmc: 0x00
link_layer: IB
max_msg_sz: 0x40000000
port_cap_flags: 0x02510868
max_vl_num: 8 (4)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 128
gid_tbl_len: 128
subnet_timeout: 17
init_type_reply: 0
active_width: 4X (2)
active_speed: 10.0 Gbps (4)
phys_state: LINK_UP (5)
GID[ 0]: fe80:0000:0000:0000:0002:c903:004b:2171
##################################################################
--
Wesley Emeneker, Research Scientist
The Partnership for an Advanced Computing Environment
Georgia Institute of Technology
404.385.2303
Wesley.Emeneker_at_[hidden]
http://pace.gatech.edu
|