Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] FW: OMPI v1.6.3 Inconsistent behaviour involving MPI_Comm_connect (can't find route) (UNCLASSIFIED)
From: Burns, Andrew J CTR (US) (andrew.j.burns35.ctr_at_[hidden])
Date: 2013-10-17 09:56:32


Classification: UNCLASSIFIED
Caveats: NONE

Possibly related to:
https://svn.open-mpi.org/trac/ompi/ticket/2904
and
http://www.open-mpi.org/community/lists/devel/2012/09/11509.php

I am attempting to link communicators from a series of programs together and am running into inconsistent behavior when using
OpenMPI.

Attached is a minimalistic example of code that will generate this issue, the same code executes without issue when using MPICH2.

The attached code is compiled with the commands:

mpicxx mpiAccept.cpp -o acceptTest
mpicxx mpiConnect.cpp -o connectTest
mpicxx mpiConnect2.cpp -o connect2Test

I used gcc 4.4.1 and openmpi 1.6.3

Job file contains the following relevant options:

#!/bin/tcsh
#PBS -l walltime=00:05:00
#PBS -l select=3:ncpus=8

and executes the program using the following commands:

mpirun --tag-output -n 8 ./acceptTest > logConnect1.log &

sleep 5

mpirun --tag-output -n 8 ./connectTest > logConnect2.log &

sleep 5

mpirun --tag-output -n 8 ./connect2Test > logConnect3.log

Note that the number of cores is 8, this is a case that executes properly.

However, changing the execution commands to the following:

mpirun --tag-output -n 7 ./acceptTest > logConnect1.log &

sleep 5

mpirun --tag-output -n 7 ./connectTest > logConnect2.log &

sleep 5

mpirun --tag-output -n 7 ./connect2Test > logConnect3.log

causes errors of the form:

[hostname:31326] [[14363,0],0]:route_callback tried routing message from
[[14363,1],0] to [[14337,1],2]:102, can't find route
[0] func:[higher levels stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/libopen-rte.so.4(opal_backtrace_print+0x1f) [0x2ad8c884b9ef]
[1] func:[higher levels stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/openmpi/mca_rml_oob.so(+0x26ba) [0x2ad8ca6f26ba]
[2] func:[higher levels stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x278)
[0x2ad8cad1b358]
[3] func:[higher levels stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/openmpi/mca_oob_tcp.so(+0x980a) [0x2ad8cad1c80a]
[4] func:[higher levels stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/libopen-rte.so.4(opal_event_base_loop+0x238) [0x2ad8c8835888]
[5] func:mpirun(orterun+0xe80) [0x404bae]
[6] func:mpirun(main+0x20) [0x403ae4]
[7] func:/lib64/libc.so.6(__libc_start_main+0xe6) [0x2ad8c9797bc6]
[8] func:mpirun() [0x403a09]

The point of failure seems to be in a MPI_Bcast call. Most of the cores make it through the call and show the broadcast value as
appropriate. However, there are several cores on the second and third processes (connectTest and connect2Test) that hang at the last

broadcast and at least one throws the above error.

I have tried several combinations of core amounts and have gotten the following results:

Of the form (# acceptTest cores, # connectTest cores, # connect2Test cores)

Successes:

1 1 1 across 1:3
2 2 2 across 1:6
4 4 4 across 2:8
8 8 8 across 3:8
16 16 16 across 6:8
16 4 4 across 3:8
16 4 16 across 5:8
8 4 4 across 2:8
8 7 7 across 3:8
8 7 6 across 3:8
4 3 2 across 2:8

Failures:
3 3 3 across 2:8
5 5 5 across 2:8
6 6 6 across 3:8
7 7 7 across 3:8
9 9 9 across 4:8
10 10 10 across 4:8
11 11 11 across 5:8
12 12 12 across 5:8
13 13 13 across 5:8
14 14 14 across 6:8
15 15 15 across 6:8
4 4 16 across 3:8
4 4 8 across 2:8

Other notes:
In the case of 6 6 6 across 3:8 it is consistently cores 0 and 1 of process 2 and cores 2 and 3 of process 3 that get blocked.

It seems that the first process must have a number of cores that is a power of 2 and must also have a number of cores greater than
the
other two processes individually.

Other versions of OpenMPI:

OpenMPI 1.7.2:
Fails in all cases during MPI_Comm_accept/MPI_Comm_connect with the following error:

[hostname:16109] [[27626,0],0]:route_callback tried routing message from [[27626,1],0] to [[27557,1],0]:30, can't find route
[0] func:[higher levels stripped]/openmpi-1.7.2built/lib/libopen-pal.so.5(opal_backtrace_print+0x1f) [0x2abd542a876f]
[1] func:[higher levels stripped]/openmpi-1.7.2built/lib/openmpi/mca_rml_oob.so(+0x25f3) [0x2abd5676f5f3]
[2] func:[higher levels stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x2c0)
[0x2abd5697d040]
[3] func:[higher levels stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(+0xb0a7) [0x2abd5697f0a7]
[4] func:[higher levels stripped]/openmpi-1.7.2built/lib/libopen-pal.so.5(opal_libevent2019_event_base_loop+0x323) [0x2abd542ade63]
[5] func:mpirun(orterun+0xe3b) [0x404c3f]
[6] func:mpirun(main+0x20) [0x403bb4]
[7] func:/lib64/libc.so.6(__libc_start_main+0xe6) [0x2abd55406bc6]
[8] func:mpirun() [0x403ad9]
[hostname:15968] *** Process received signal ***
[hostname:15968] Signal: Segmentation fault (11)
[hostname:15968] Signal code: Address not mapped (1)
[hostname:15968] Failing at address: 0x6ef34010
[hostname:15968] [ 0] /lib64/libpthread.so.0(+0xf6b0) [0x2b75859cf6b0]
[hostname:15968] [ 1] /lib64/libc.so.6(+0x77d0f) [0x2b7585c54d0f]
[hostname:15968] [ 2] /lib64/libc.so.6(__libc_malloc+0x77) [0x2b7585c572d7]
[hostname:15968] [ 3] [higher levels stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_handler+0x15f)
[0x2b75871716af]
[hostname:15968] [ 4] [higher levels stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(+0xb078) [0x2b7587174078]
[hostname:15968] [ 5] [higher levels stripped]/openmpi-1.7.2built/lib/libopen-pal.so.5(opal_libevent2019_event_base_loop+0x323)
[0x2b7584aa2e63]
[hostname:15968] [ 6] mpirun(orterun+0xe3b) [0x404c3f]
[hostname:15968] [ 7] mpirun(main+0x20) [0x403bb4]
[hostname:15968] [ 8] /lib64/libc.so.6(__libc_start_main+0xe6) [0x2b7585bfbbc6]
[hostname:15968] [ 9] mpirun() [0x403ad9]
[hostname:15968] *** End of error message ***

OpenMPI 1.7.3rc
Fails in all cases during MPI_Comm_accept/MPI_Comm_connect with the following error:

[hostname:19222] [[19635,0],0]:route_callback tried routing message from [[19635,1],0] to [[19793,1],0]:30, can't find route
[0] func:[higher levels stripped]/openmpi-1.7.3rc3built/lib/libopen-pal.so.6(opal_backtrace_print+0x1f) [0x2b43eb07088f]
[1] func:[higher levels stripped]/openmpi-1.7.3rc3built/lib/openmpi/mca_rml_oob.so(+0x2733) [0x2b43ed55f733]
[2] func:[higher levels stripped]/openmpi-1.7.3rc3built/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x2c0)
[0x2b43ed76d440]
[3] func:[higher levels stripped]/openmpi-1.7.3rc3built/lib/openmpi/mca_oob_tcp.so(+0xb4a7) [0x2b43ed76f4a7]
[4] func:[higher levels stripped]/openmpi-1.7.3rc3built/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x88c)
[0x2b43eb07844c]
[5] func:mpirun(orterun+0xe25) [0x404c29]
[6] func:mpirun(main+0x20) [0x403bb4]
[7] func:/lib64/libc.so.6(__libc_start_main+0xe6) [0x2b43ec1d3bc6]
[8] func:mpirun() [0x403ad9]

Andrew Burns
Lockheed Martin
Software Engineer
410-306-0409
andrew.j.burns2_at_[hidden]
andrew.j.burns35.ctr_at_[hidden]

Classification: UNCLASSIFIED
Caveats: NONE





  • application/x-pkcs7-signature attachment: smime.p7s