Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem with rankfile and openmpi-1.6.2
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-10-03 08:54:23


I saw your earlier note about this too. Just a little busy right now, but hope to look at it soon.

Your rankfile looks fine, so undoubtedly a bug has crept into this rarely-used code path.

On Oct 3, 2012, at 3:03 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:

> Hi,
>
> I want to test process bindings with a rankfile in openmpi-1.6.2. Both
> machines are dual-processor dual-core machines running Solaris 10 x86_64.
>
> tyr fd1026 138 cat host_sunpc0_1
> sunpc0 slots=4
> sunpc1 slots=4
>
> tyr fd1026 139 cat rankfile
> rank 0=sunpc0 slot=0:0-1,1:0-1
> rank 1=sunpc1 slot=0:0-1
> rank 2=sunpc1 slot=1:0
> rank 3=sunpc1 slot=1:1
>
> tyr fd1026 140 mpiexec -rf rankfile hostname
> --------------------------------------------------------------------------
> All nodes which are allocated for this job are already filled.
> --------------------------------------------------------------------------
>
> Is something wrong with my rankfile, must I add a hostfile, or is it a
> bug? I get the following error when I add a hostfile.
>
>
> tyr fd1026 141 mpiexec -hostfile host_sunpc0_1 -rf rankfile hostname
> [tyr.informatik.hs-fulda.de:20227] [[27927,0],0] ORTE_ERROR_LOG:
> Data unpack would read past end of buffer in file
> ../../../../openmpi-1.6.2/orte/mca/odls/base/odls_base_default_fns.c
> at line 927
> ^Cmpiexec: abort is already in progress...hit ctrl-c again to forcibly
> terminate
>
>
> I get the following outputs when I use Linux instead of Solaris
> (same hardware).
>
> tyr fd1026 146 mpiexec -rf rankfile_linux hostname
> --------------------------------------------------------------------------
> All nodes which are allocated for this job are already filled.
> --------------------------------------------------------------------------
>
> tyr fd1026 147 mpiexec -hostfile host_linpc0_1 -rf rankfile_linux hostname
> [tyr.informatik.hs-fulda.de:20260] [[27952,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
> file ../../../../openmpi-1.6.2/orte/mca/odls/base/odls_base_default_fns.c at line 927
> [tyr:20260] *** Process received signal ***
> [tyr:20260] Signal: Bus Error (10)
> [tyr:20260] Signal code: Invalid address alignment (1)
> [tyr:20260] Failing at address: 7463703a2f2f3129
> /export2/prog/SunOS_sparc/openmpi-1.6.2_64_cc/lib64/libopen-rte.so.4.0.0:opal_backtrace_print+0x14
> /export2/prog/SunOS_sparc/openmpi-1.6.2_64_cc/lib64/libopen-rte.so.4.0.0:0x335b48
> /lib/sparcv9/libc.so.1:0xd88a4
> /lib/sparcv9/libc.so.1:0xcc418
> /lib/sparcv9/libc.so.1:0xcc624
> /lib/sparcv9/libc.so.1:0x64394 [ Signal 2131043744 (?)]
> /lib/sparcv9/libc.so.1:free+0x30
> /export2/prog/SunOS_sparc/openmpi-1.6.2_64_cc/lib64/libopen-rte.so.4.0.0:orte_odls_base_default_construct_child
> _list+0x20b8
> /export2/prog/SunOS_sparc/openmpi-1.6.2_64_cc/lib64/openmpi/mca_odls_default.so:0x11c80
> ...
>
> "tyr" is a Sparc machine running Solaris 10. I get a similar error if
> I run the command on a Linux machine.
>
> tyr fd1026 148 ssh linpc4
> linpc4 fd1026 100 mpiexec -rf rankfile_linux hostname
> --------------------------------------------------------------------------
> All nodes which are allocated for this job are already filled.
> --------------------------------------------------------------------------
>
> linpc4 fd1026 101 mpiexec -hostfile host_linpc0_1 -rf rankfile_linux hostname
> [linpc4:08079] [[49559,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
> ../../../../openmpi-1.6.2/orte/mca/odls/base/odls_base_default_fns.c at line 927
> [linpc4:08079] *** Process received signal ***
> [linpc4:08079] Signal: Segmentation fault (11)
> [linpc4:08079] Signal code: Address not mapped (1)
> [linpc4:08079] Failing at address: 0x900306368
> [linpc4:08079] [ 0] /lib64/libpthread.so.0(+0xfd00) [0x7fbe174bcd00]
> [linpc4:08079] [ 1] /lib64/libc.so.6(cfree+0x14) [0x7fbe17197d24]
> [linpc4:08079] [ 2]
> /usr/local/openmpi-1.6.2_64_cc/lib64/libopen-rte.so.4(orte_odls_base_default_construct_child_list+0x2091)
> [0x7fbe182e4d21]
> [linpc4:08079] [ 3] /usr/local/openmpi-1.6.2_64_cc/lib64/openmpi/mca_odls_default.so(+0x10dba) [0x7fbe15415dba]
> ...
>
> Thank you very much for any suggestion in advance.
>
>
> Kind regards
>
> Siegmar
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users