Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] FW: OMPI v1.6.3 Inconsistent behaviour involving MPI_Comm_connect (can't find route) (UNCLASSIFIED)
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-10-17 11:02:34


I suspect the problem is in Intercomm_merge, as the comment in your file suggests. There were some bug fixes in that code, but they haven't migrated to the 1.7 branch yet (scheduled for 1.7.4).

On Oct 17, 2013, at 6:56 AM, "Burns, Andrew J CTR (US)" <andrew.j.burns35.ctr_at_[hidden]> wrote:

> Classification: UNCLASSIFIED
> Caveats: NONE
>
> Possibly related to:
> https://svn.open-mpi.org/trac/ompi/ticket/2904
> and
> http://www.open-mpi.org/community/lists/devel/2012/09/11509.php
>
> I am attempting to link communicators from a series of programs together and am running into inconsistent behavior when using
> OpenMPI.
>
> Attached is a minimalistic example of code that will generate this issue, the same code executes without issue when using MPICH2.
>
> The attached code is compiled with the commands:
>
> mpicxx mpiAccept.cpp -o acceptTest
> mpicxx mpiConnect.cpp -o connectTest
> mpicxx mpiConnect2.cpp -o connect2Test
>
> I used gcc 4.4.1 and openmpi 1.6.3
>
>
> Job file contains the following relevant options:
>
> #!/bin/tcsh
> #PBS -l walltime=00:05:00
> #PBS -l select=3:ncpus=8
>
>
> and executes the program using the following commands:
>
>
> mpirun --tag-output -n 8 ./acceptTest > logConnect1.log &
>
> sleep 5
>
> mpirun --tag-output -n 8 ./connectTest > logConnect2.log &
>
> sleep 5
>
> mpirun --tag-output -n 8 ./connect2Test > logConnect3.log
>
>
> Note that the number of cores is 8, this is a case that executes properly.
>
> However, changing the execution commands to the following:
>
>
> mpirun --tag-output -n 7 ./acceptTest > logConnect1.log &
>
> sleep 5
>
> mpirun --tag-output -n 7 ./connectTest > logConnect2.log &
>
> sleep 5
>
> mpirun --tag-output -n 7 ./connect2Test > logConnect3.log
>
>
> causes errors of the form:
>
> [hostname:31326] [[14363,0],0]:route_callback tried routing message from
> [[14363,1],0] to [[14337,1],2]:102, can't find route
> [0] func:[higher levels stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/libopen-rte.so.4(opal_backtrace_print+0x1f) [0x2ad8c884b9ef]
> [1] func:[higher levels stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/openmpi/mca_rml_oob.so(+0x26ba) [0x2ad8ca6f26ba]
> [2] func:[higher levels stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x278)
> [0x2ad8cad1b358]
> [3] func:[higher levels stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/openmpi/mca_oob_tcp.so(+0x980a) [0x2ad8cad1c80a]
> [4] func:[higher levels stripped]/opmpi/gcc/4.4.1/openmpi-1.6.3/lib/libopen-rte.so.4(opal_event_base_loop+0x238) [0x2ad8c8835888]
> [5] func:mpirun(orterun+0xe80) [0x404bae]
> [6] func:mpirun(main+0x20) [0x403ae4]
> [7] func:/lib64/libc.so.6(__libc_start_main+0xe6) [0x2ad8c9797bc6]
> [8] func:mpirun() [0x403a09]
>
> The point of failure seems to be in a MPI_Bcast call. Most of the cores make it through the call and show the broadcast value as
> appropriate. However, there are several cores on the second and third processes (connectTest and connect2Test) that hang at the last
>
> broadcast and at least one throws the above error.
>
>
> I have tried several combinations of core amounts and have gotten the following results:
>
> Of the form (# acceptTest cores, # connectTest cores, # connect2Test cores)
>
> Successes:
>
> 1 1 1 across 1:3
> 2 2 2 across 1:6
> 4 4 4 across 2:8
> 8 8 8 across 3:8
> 16 16 16 across 6:8
> 16 4 4 across 3:8
> 16 4 16 across 5:8
> 8 4 4 across 2:8
> 8 7 7 across 3:8
> 8 7 6 across 3:8
> 4 3 2 across 2:8
>
> Failures:
> 3 3 3 across 2:8
> 5 5 5 across 2:8
> 6 6 6 across 3:8
> 7 7 7 across 3:8
> 9 9 9 across 4:8
> 10 10 10 across 4:8
> 11 11 11 across 5:8
> 12 12 12 across 5:8
> 13 13 13 across 5:8
> 14 14 14 across 6:8
> 15 15 15 across 6:8
> 4 4 16 across 3:8
> 4 4 8 across 2:8
>
>
> Other notes:
> In the case of 6 6 6 across 3:8 it is consistently cores 0 and 1 of process 2 and cores 2 and 3 of process 3 that get blocked.
>
> It seems that the first process must have a number of cores that is a power of 2 and must also have a number of cores greater than
> the
> other two processes individually.
>
>
> Other versions of OpenMPI:
>
> OpenMPI 1.7.2:
> Fails in all cases during MPI_Comm_accept/MPI_Comm_connect with the following error:
>
> [hostname:16109] [[27626,0],0]:route_callback tried routing message from [[27626,1],0] to [[27557,1],0]:30, can't find route
> [0] func:[higher levels stripped]/openmpi-1.7.2built/lib/libopen-pal.so.5(opal_backtrace_print+0x1f) [0x2abd542a876f]
> [1] func:[higher levels stripped]/openmpi-1.7.2built/lib/openmpi/mca_rml_oob.so(+0x25f3) [0x2abd5676f5f3]
> [2] func:[higher levels stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x2c0)
> [0x2abd5697d040]
> [3] func:[higher levels stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(+0xb0a7) [0x2abd5697f0a7]
> [4] func:[higher levels stripped]/openmpi-1.7.2built/lib/libopen-pal.so.5(opal_libevent2019_event_base_loop+0x323) [0x2abd542ade63]
> [5] func:mpirun(orterun+0xe3b) [0x404c3f]
> [6] func:mpirun(main+0x20) [0x403bb4]
> [7] func:/lib64/libc.so.6(__libc_start_main+0xe6) [0x2abd55406bc6]
> [8] func:mpirun() [0x403ad9]
> [hostname:15968] *** Process received signal ***
> [hostname:15968] Signal: Segmentation fault (11)
> [hostname:15968] Signal code: Address not mapped (1)
> [hostname:15968] Failing at address: 0x6ef34010
> [hostname:15968] [ 0] /lib64/libpthread.so.0(+0xf6b0) [0x2b75859cf6b0]
> [hostname:15968] [ 1] /lib64/libc.so.6(+0x77d0f) [0x2b7585c54d0f]
> [hostname:15968] [ 2] /lib64/libc.so.6(__libc_malloc+0x77) [0x2b7585c572d7]
> [hostname:15968] [ 3] [higher levels stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_handler+0x15f)
> [0x2b75871716af]
> [hostname:15968] [ 4] [higher levels stripped]/openmpi-1.7.2built/lib/openmpi/mca_oob_tcp.so(+0xb078) [0x2b7587174078]
> [hostname:15968] [ 5] [higher levels stripped]/openmpi-1.7.2built/lib/libopen-pal.so.5(opal_libevent2019_event_base_loop+0x323)
> [0x2b7584aa2e63]
> [hostname:15968] [ 6] mpirun(orterun+0xe3b) [0x404c3f]
> [hostname:15968] [ 7] mpirun(main+0x20) [0x403bb4]
> [hostname:15968] [ 8] /lib64/libc.so.6(__libc_start_main+0xe6) [0x2b7585bfbbc6]
> [hostname:15968] [ 9] mpirun() [0x403ad9]
> [hostname:15968] *** End of error message ***
>
>
> OpenMPI 1.7.3rc
> Fails in all cases during MPI_Comm_accept/MPI_Comm_connect with the following error:
>
> [hostname:19222] [[19635,0],0]:route_callback tried routing message from [[19635,1],0] to [[19793,1],0]:30, can't find route
> [0] func:[higher levels stripped]/openmpi-1.7.3rc3built/lib/libopen-pal.so.6(opal_backtrace_print+0x1f) [0x2b43eb07088f]
> [1] func:[higher levels stripped]/openmpi-1.7.3rc3built/lib/openmpi/mca_rml_oob.so(+0x2733) [0x2b43ed55f733]
> [2] func:[higher levels stripped]/openmpi-1.7.3rc3built/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_complete+0x2c0)
> [0x2b43ed76d440]
> [3] func:[higher levels stripped]/openmpi-1.7.3rc3built/lib/openmpi/mca_oob_tcp.so(+0xb4a7) [0x2b43ed76f4a7]
> [4] func:[higher levels stripped]/openmpi-1.7.3rc3built/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x88c)
> [0x2b43eb07844c]
> [5] func:mpirun(orterun+0xe25) [0x404c29]
> [6] func:mpirun(main+0x20) [0x403bb4]
> [7] func:/lib64/libc.so.6(__libc_start_main+0xe6) [0x2b43ec1d3bc6]
> [8] func:mpirun() [0x403ad9]
>
>
> Andrew Burns
> Lockheed Martin
> Software Engineer
> 410-306-0409
> andrew.j.burns2_at_[hidden]
> andrew.j.burns35.ctr_at_[hidden]
>
> Classification: UNCLASSIFIED
> Caveats: NONE
>
>
> <test files.zip>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users