Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem with rankfile in openmpi-1.7.4rc2r30323
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-22 10:38:35


Hard to know how to address all that, Siegmar, but I'll give it a shot. See below.

On Jan 22, 2014, at 5:34 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:

> Hi,
>
> yesterday I installed openmpi-1.7.4rc2r30323 on our machines
> ("Solaris 10 x86_64", "Solaris 10 Sparc", and "openSUSE Linux
> 12.1 x86_64" with Sun C 5.12). My rankfile "rf_linpc_sunpc_tyr"
> contains the following lines.
>
> rank 0=linpc0 slot=0:0-1;1:0-1
> rank 1=linpc1 slot=0:0-1
> rank 2=sunpc1 slot=1:0
> rank 3=tyr slot=1:0
>
> I get no output, when I run the following command.
>
> mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
>
> "dbx" reports the following problem.
>
> /opt/solstudio12.3/bin/sparcv9/dbx \
> /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
> For information about new features see `help changes'
> To remove this message, put `dbxenv suppress_startup_message
> 7.9' in your .dbxrc
> Reading mpiexec
> Reading ld.so.1
> ...
> Reading libmd.so.1
> (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
> Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
> (process id 22337)
> Reading libc_psr.so.1
> ...
> Reading mca_dfs_test.so
>
> execution completed, exit code is 1
> (dbx) check -all
> access checking - ON
> memuse checking - ON
> (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
> Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
> (process id 22344)
> Reading rtcapihook.so
> ...
> RTC: Running program...
> Read from uninitialized (rui) on thread 1:
> Attempting to read 1 byte at address 0xffffffff7fffbf8b
> which is 459 bytes above the current stack pointer
> Variable is 'cwd'
> t_at_1 (l_at_1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c"
> 65 if (0 != strcmp(pwd, cwd)) {
> (dbx) quit
>

This looks like a bogus issue to me. Are you able to run something *without* a rankfile? In other words, is it rankfile operation that is causing a problem, or are you unable to run anything on Sparc?

>
>
>
> Rankfiles work "fine" on x86_64 architectures. Contents of my rankfile.
>
> rank 0=linpc1 slot=0:0-1;1:0-1
> rank 1=sunpc1 slot=0:0-1
> rank 2=sunpc1 slot=1:0
> rank 3=sunpc1 slot=1:1
>
>
> mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname
> [sunpc1:13489] MCW rank 1 bound to socket 0[core 0[hwt 0]],
> socket 0[core 1[hwt 0]]: [B/B][./.]
> [sunpc1:13489] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
> [sunpc1:13489] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B]
> sunpc1
> sunpc1
> sunpc1
> [linpc1:29997] MCW rank 0 is not bound (or bound to all available
> processors)
> linpc1
>
>
> Unfortunately "dbx" reports nevertheless a problem.
>
> /opt/solstudio12.3/bin/amd64/dbx \
> /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
> For information about new features see `help changes'
> To remove this message, put `dbxenv suppress_startup_message 7.9'
> in your .dbxrc
> Reading mpiexec
> Reading ld.so.1
> ...
> Reading libmd.so.1
> (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname
> Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname
> (process id 18330)
> Reading mca_shmem_mmap.so
> ...
> Reading mca_dfs_test.so
> [sunpc1:18330] MCW rank 1 bound to socket 0[core 0[hwt 0]],
> socket 0[core 1[hwt 0]]: [B/B][./.]
> [sunpc1:18330] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
> [sunpc1:18330] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B]
> sunpc1
> sunpc1
> sunpc1
> [linpc1:30148] MCW rank 0 is not bound (or bound to all available
> processors)
> linpc1
>
> execution completed, exit code is 0
> (dbx) check -all
> access checking - ON
> memuse checking - ON
> (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname
> Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname
> (process id 18340)
> Reading rtcapihook.so
> ...
>
> RTC: Running program...
> Reading disasm.so
> Read from uninitialized (rui) on thread 1:
> Attempting to read 1 byte at address 0x436d57
> which is 15 bytes into a heap block of size 16 bytes at 0x436d48
> This block was allocated from:
> [1] vasprintf() at 0xfffffd7fdc9b335a
> [2] asprintf() at 0xfffffd7fdc9b3452
> [3] opal_output_init() at line 184 in "output.c"
> [4] do_open() at line 548 in "output.c"
> [5] opal_output_open() at line 219 in "output.c"
> [6] opal_malloc_init() at line 68 in "malloc.c"
> [7] opal_init_util() at line 250 in "opal_init.c"
> [8] orterun() at line 658 in "orterun.c"
>
> t_at_1 (l_at_1) stopped in do_open at line 638 in file "output.c"
> 638 info[i].ldi_prefix = strdup(lds->lds_prefix);
> (dbx)
>
>

Again, I think dbx is just getting lost

>
>
>
> I can also manually bind threads on our Sun M4000 server (two quad-core
> Sparc VII processors with two hwthreads each).
>
> mpiexec --report-bindings -np 4 --bind-to hwthread hostname
> [rs0.informatik.hs-fulda.de:09531] MCW rank 1 bound to
> socket 0[core 1[hwt 0]]: [../B./../..][../../../..]
> [rs0.informatik.hs-fulda.de:09531] MCW rank 2 bound to
> socket 1[core 4[hwt 0]]: [../../../..][B./../../..]
> [rs0.informatik.hs-fulda.de:09531] MCW rank 3 bound to
> socket 1[core 5[hwt 0]]: [../../../..][../B./../..]
> [rs0.informatik.hs-fulda.de:09531] MCW rank 0 bound to
> socket 0[core 0[hwt 0]]: [B./../../..][../../../..]
> rs0.informatik.hs-fulda.de
> rs0.informatik.hs-fulda.de
> rs0.informatik.hs-fulda.de
> rs0.informatik.hs-fulda.de
>
>
> It doesn't work with cores. I know that it wasn't possible last
> summer and it seems that it is still not possible now.
>
> mpiexec --report-bindings -np 4 --bind-to core hostname
> -----------------------------------------------------------------------
> Open MPI tried to bind a new process, but something went wrong. The
> process was killed without launching the target application. Your job
> will now abort.
>
> Local host: rs0
> Application name: /usr/local/bin/hostname
> Error message: hwloc indicates cpu binding cannot be enforced
> Location:
> ../../../../../openmpi-1.9a1r30290/orte/mca/odls/default/odls_default_module.c:500
> -----------------------------------------------------------------------
> 4 total processes failed to start
>
>
>
> Is it possible to specify hwthreads in a rankfile, so that I can
> use a rankfile for these machines?

Possible - yes. Will it happen in immediate future - no, I'm afraid I'm swamped right now. However, I'll make a note of it for the future

>
> I get the expected output, if I use two M4000 servers, although the
> above mentioned error still exists.
>
>
> /opt/solstudio12.3/bin/sparcv9/dbx \
> /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
> For information about new features see `help changes'
> To remove this message, put `dbxenv suppress_startup_message 7.9'
> in your .dbxrc
> Reading mpiexec
> Reading ld.so.1
> ...
> Reading libmd.so.1
> (dbx) run --report-bindings --host rs0,rs1 -np 4 \
> --bind-to hwthread hostname
> Running: mpiexec --report-bindings --host rs0,rs1 -np 4
> --bind-to hwthread hostname
> (process id 9599)
> Reading libc_psr.so.1
> ...
> Reading mca_dfs_test.so
> [rs0.informatik.hs-fulda.de:09599] MCW rank 1 bound to
> socket 1[core 4[hwt 0]]: [../../../..][B./../../..]
> [rs0.informatik.hs-fulda.de:09599] MCW rank 0 bound to
> socket 0[core 0[hwt 0]]: [B./../../..][../../../..]
> rs0.informatik.hs-fulda.de
> rs0.informatik.hs-fulda.de
> rs1.informatik.hs-fulda.de
> [rs1.informatik.hs-fulda.de:13398] MCW rank 2 bound to
> socket 0[core 0[hwt 0]]: [B./../../..][../../../..]
> [rs1.informatik.hs-fulda.de:13398] MCW rank 3 bound to
> socket 1[core 4[hwt 0]]: [../../../..][B./../../..]
> rs1.informatik.hs-fulda.de
>
> execution completed, exit code is 0
> (dbx) check -all
> access checking - ON
> memuse checking - ON
> (dbx) run --report-bindings --host rs0,rs1 -np 4 \
> --bind-to hwthread hostname
> Running: mpiexec --report-bindings --host rs0,rs1 -np 4
> --bind-to hwthread hostname
> (process id 9607)
> Reading rtcapihook.so
> ...
> RTC: Running program...
> Read from uninitialized (rui) on thread 1:
> Attempting to read 1 byte at address 0xffffffff7fffc80b
> which is 459 bytes above the current stack pointer
> Variable is 'cwd'
> dbx: warning: can't find file
> ".../openmpi-1.7.4rc2r30323-SunOS.sparc.64_cc/opal/util/../../../
> openmpi-1.7.4rc2r30323/opal/util/opal_getcwd.c"
> dbx: warning: see `help finding-files'
> t_at_1 (l_at_1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c"
> (dbx)
>
>
> Our M4000 server has no access to the source code, so that it couldn't
> find the file. Nevertheless it is the same error message as above. Is it
> possible that someone soves this problem? Thank you very much for any
> help in advance.
>
>
> Kind regards
>
> Siegmar
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users