Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem with rankfile and openmpi-1.6.2
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-10-03 09:42:04


I filed a bug fix for this one. However, something you should note.

If you fail to provide a "-np N" argument to mpiexec, we assume you want ALL all available slots filled. The rankfile will contain only those procs that you want specifically bound. The remaining procs will be unbound.

So with your hostfile, we are going to run EIGHT processes, with ranks 0-3 located as specified in the rankfile.

If that isn't what you want, then you should add -np 4 to your cmd line.

On Oct 3, 2012, at 3:03 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:

> Hi,
>
> I want to test process bindings with a rankfile in openmpi-1.6.2. Both
> machines are dual-processor dual-core machines running Solaris 10 x86_64.
>
> tyr fd1026 138 cat host_sunpc0_1
> sunpc0 slots=4
> sunpc1 slots=4
>
> tyr fd1026 139 cat rankfile
> rank 0=sunpc0 slot=0:0-1,1:0-1
> rank 1=sunpc1 slot=0:0-1
> rank 2=sunpc1 slot=1:0
> rank 3=sunpc1 slot=1:1
>
> tyr fd1026 140 mpiexec -rf rankfile hostname
> --------------------------------------------------------------------------
> All nodes which are allocated for this job are already filled.
> --------------------------------------------------------------------------
>
> Is something wrong with my rankfile, must I add a hostfile, or is it a
> bug? I get the following error when I add a hostfile.
>
>
> tyr fd1026 141 mpiexec -hostfile host_sunpc0_1 -rf rankfile hostname
> [tyr.informatik.hs-fulda.de:20227] [[27927,0],0] ORTE_ERROR_LOG:
> Data unpack would read past end of buffer in file
> ../../../../openmpi-1.6.2/orte/mca/odls/base/odls_base_default_fns.c
> at line 927
> ^Cmpiexec: abort is already in progress...hit ctrl-c again to forcibly
> terminate
>
>
> I get the following outputs when I use Linux instead of Solaris
> (same hardware).
>
> tyr fd1026 146 mpiexec -rf rankfile_linux hostname
> --------------------------------------------------------------------------
> All nodes which are allocated for this job are already filled.
> --------------------------------------------------------------------------
>
> tyr fd1026 147 mpiexec -hostfile host_linpc0_1 -rf rankfile_linux hostname
> [tyr.informatik.hs-fulda.de:20260] [[27952,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in
> file ../../../../openmpi-1.6.2/orte/mca/odls/base/odls_base_default_fns.c at line 927
> [tyr:20260] *** Process received signal ***
> [tyr:20260] Signal: Bus Error (10)
> [tyr:20260] Signal code: Invalid address alignment (1)
> [tyr:20260] Failing at address: 7463703a2f2f3129
> /export2/prog/SunOS_sparc/openmpi-1.6.2_64_cc/lib64/libopen-rte.so.4.0.0:opal_backtrace_print+0x14
> /export2/prog/SunOS_sparc/openmpi-1.6.2_64_cc/lib64/libopen-rte.so.4.0.0:0x335b48
> /lib/sparcv9/libc.so.1:0xd88a4
> /lib/sparcv9/libc.so.1:0xcc418
> /lib/sparcv9/libc.so.1:0xcc624
> /lib/sparcv9/libc.so.1:0x64394 [ Signal 2131043744 (?)]
> /lib/sparcv9/libc.so.1:free+0x30
> /export2/prog/SunOS_sparc/openmpi-1.6.2_64_cc/lib64/libopen-rte.so.4.0.0:orte_odls_base_default_construct_child
> _list+0x20b8
> /export2/prog/SunOS_sparc/openmpi-1.6.2_64_cc/lib64/openmpi/mca_odls_default.so:0x11c80
> ...
>
> "tyr" is a Sparc machine running Solaris 10. I get a similar error if
> I run the command on a Linux machine.
>
> tyr fd1026 148 ssh linpc4
> linpc4 fd1026 100 mpiexec -rf rankfile_linux hostname
> --------------------------------------------------------------------------
> All nodes which are allocated for this job are already filled.
> --------------------------------------------------------------------------
>
> linpc4 fd1026 101 mpiexec -hostfile host_linpc0_1 -rf rankfile_linux hostname
> [linpc4:08079] [[49559,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file
> ../../../../openmpi-1.6.2/orte/mca/odls/base/odls_base_default_fns.c at line 927
> [linpc4:08079] *** Process received signal ***
> [linpc4:08079] Signal: Segmentation fault (11)
> [linpc4:08079] Signal code: Address not mapped (1)
> [linpc4:08079] Failing at address: 0x900306368
> [linpc4:08079] [ 0] /lib64/libpthread.so.0(+0xfd00) [0x7fbe174bcd00]
> [linpc4:08079] [ 1] /lib64/libc.so.6(cfree+0x14) [0x7fbe17197d24]
> [linpc4:08079] [ 2]
> /usr/local/openmpi-1.6.2_64_cc/lib64/libopen-rte.so.4(orte_odls_base_default_construct_child_list+0x2091)
> [0x7fbe182e4d21]
> [linpc4:08079] [ 3] /usr/local/openmpi-1.6.2_64_cc/lib64/openmpi/mca_odls_default.so(+0x10dba) [0x7fbe15415dba]
> ...
>
> Thank you very much for any suggestion in advance.
>
>
> Kind regards
>
> Siegmar
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users