Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] problem with rankfile in openmpi-1.7.4rc2r30323
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2014-01-22 08:34:45


Hi,

yesterday I installed openmpi-1.7.4rc2r30323 on our machines
("Solaris 10 x86_64", "Solaris 10 Sparc", and "openSUSE Linux
12.1 x86_64" with Sun C 5.12). My rankfile "rf_linpc_sunpc_tyr"
contains the following lines.

rank 0=linpc0 slot=0:0-1;1:0-1
rank 1=linpc1 slot=0:0-1
rank 2=sunpc1 slot=1:0
rank 3=tyr slot=1:0

I get no output, when I run the following command.

mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname

"dbx" reports the following problem.

/opt/solstudio12.3/bin/sparcv9/dbx \
  /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message
  7.9' in your .dbxrc
Reading mpiexec
Reading ld.so.1
...
Reading libmd.so.1
(dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
(process id 22337)
Reading libc_psr.so.1
...
Reading mca_dfs_test.so

execution completed, exit code is 1
(dbx) check -all
access checking - ON
memuse checking - ON
(dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname
(process id 22344)
Reading rtcapihook.so
...
RTC: Running program...
Read from uninitialized (rui) on thread 1:
Attempting to read 1 byte at address 0xffffffff7fffbf8b
    which is 459 bytes above the current stack pointer
Variable is 'cwd'
t_at_1 (l_at_1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c"
   65 if (0 != strcmp(pwd, cwd)) {
(dbx) quit

Rankfiles work "fine" on x86_64 architectures. Contents of my rankfile.

rank 0=linpc1 slot=0:0-1;1:0-1
rank 1=sunpc1 slot=0:0-1
rank 2=sunpc1 slot=1:0
rank 3=sunpc1 slot=1:1

mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname
[sunpc1:13489] MCW rank 1 bound to socket 0[core 0[hwt 0]],
  socket 0[core 1[hwt 0]]: [B/B][./.]
[sunpc1:13489] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
[sunpc1:13489] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B]
sunpc1
sunpc1
sunpc1
[linpc1:29997] MCW rank 0 is not bound (or bound to all available
  processors)
linpc1

Unfortunately "dbx" reports nevertheless a problem.

/opt/solstudio12.3/bin/amd64/dbx \
  /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.9'
  in your .dbxrc
Reading mpiexec
Reading ld.so.1
...
Reading libmd.so.1
(dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname
Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname
(process id 18330)
Reading mca_shmem_mmap.so
...
Reading mca_dfs_test.so
[sunpc1:18330] MCW rank 1 bound to socket 0[core 0[hwt 0]],
  socket 0[core 1[hwt 0]]: [B/B][./.]
[sunpc1:18330] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.]
[sunpc1:18330] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B]
sunpc1
sunpc1
sunpc1
[linpc1:30148] MCW rank 0 is not bound (or bound to all available
  processors)
linpc1

execution completed, exit code is 0
(dbx) check -all
access checking - ON
memuse checking - ON
(dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname
Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname
(process id 18340)
Reading rtcapihook.so
...

RTC: Running program...
Reading disasm.so
Read from uninitialized (rui) on thread 1:
Attempting to read 1 byte at address 0x436d57
    which is 15 bytes into a heap block of size 16 bytes at 0x436d48
This block was allocated from:
        [1] vasprintf() at 0xfffffd7fdc9b335a
        [2] asprintf() at 0xfffffd7fdc9b3452
        [3] opal_output_init() at line 184 in "output.c"
        [4] do_open() at line 548 in "output.c"
        [5] opal_output_open() at line 219 in "output.c"
        [6] opal_malloc_init() at line 68 in "malloc.c"
        [7] opal_init_util() at line 250 in "opal_init.c"
        [8] orterun() at line 658 in "orterun.c"

t_at_1 (l_at_1) stopped in do_open at line 638 in file "output.c"
  638 info[i].ldi_prefix = strdup(lds->lds_prefix);
(dbx)

 

I can also manually bind threads on our Sun M4000 server (two quad-core
Sparc VII processors with two hwthreads each).

mpiexec --report-bindings -np 4 --bind-to hwthread hostname
[rs0.informatik.hs-fulda.de:09531] MCW rank 1 bound to
  socket 0[core 1[hwt 0]]: [../B./../..][../../../..]
[rs0.informatik.hs-fulda.de:09531] MCW rank 2 bound to
  socket 1[core 4[hwt 0]]: [../../../..][B./../../..]
[rs0.informatik.hs-fulda.de:09531] MCW rank 3 bound to
  socket 1[core 5[hwt 0]]: [../../../..][../B./../..]
[rs0.informatik.hs-fulda.de:09531] MCW rank 0 bound to
  socket 0[core 0[hwt 0]]: [B./../../..][../../../..]
rs0.informatik.hs-fulda.de
rs0.informatik.hs-fulda.de
rs0.informatik.hs-fulda.de
rs0.informatik.hs-fulda.de

It doesn't work with cores. I know that it wasn't possible last
summer and it seems that it is still not possible now.

mpiexec --report-bindings -np 4 --bind-to core hostname
-----------------------------------------------------------------------
Open MPI tried to bind a new process, but something went wrong. The
process was killed without launching the target application. Your job
will now abort.

  Local host: rs0
  Application name: /usr/local/bin/hostname
  Error message: hwloc indicates cpu binding cannot be enforced
  Location:
../../../../../openmpi-1.9a1r30290/orte/mca/odls/default/odls_default_module.c:500
-----------------------------------------------------------------------
4 total processes failed to start

Is it possible to specify hwthreads in a rankfile, so that I can
use a rankfile for these machines?

I get the expected output, if I use two M4000 servers, although the
above mentioned error still exists.

/opt/solstudio12.3/bin/sparcv9/dbx \
  /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.9'
  in your .dbxrc
Reading mpiexec
Reading ld.so.1
...
Reading libmd.so.1
(dbx) run --report-bindings --host rs0,rs1 -np 4 \
  --bind-to hwthread hostname
Running: mpiexec --report-bindings --host rs0,rs1 -np 4
  --bind-to hwthread hostname
(process id 9599)
Reading libc_psr.so.1
...
Reading mca_dfs_test.so
[rs0.informatik.hs-fulda.de:09599] MCW rank 1 bound to
  socket 1[core 4[hwt 0]]: [../../../..][B./../../..]
[rs0.informatik.hs-fulda.de:09599] MCW rank 0 bound to
  socket 0[core 0[hwt 0]]: [B./../../..][../../../..]
rs0.informatik.hs-fulda.de
rs0.informatik.hs-fulda.de
rs1.informatik.hs-fulda.de
[rs1.informatik.hs-fulda.de:13398] MCW rank 2 bound to
  socket 0[core 0[hwt 0]]: [B./../../..][../../../..]
[rs1.informatik.hs-fulda.de:13398] MCW rank 3 bound to
  socket 1[core 4[hwt 0]]: [../../../..][B./../../..]
rs1.informatik.hs-fulda.de

execution completed, exit code is 0
(dbx) check -all
access checking - ON
memuse checking - ON
(dbx) run --report-bindings --host rs0,rs1 -np 4 \
  --bind-to hwthread hostname
Running: mpiexec --report-bindings --host rs0,rs1 -np 4
  --bind-to hwthread hostname
(process id 9607)
Reading rtcapihook.so
...
RTC: Running program...
Read from uninitialized (rui) on thread 1:
Attempting to read 1 byte at address 0xffffffff7fffc80b
    which is 459 bytes above the current stack pointer
Variable is 'cwd'
dbx: warning: can't find file
  ".../openmpi-1.7.4rc2r30323-SunOS.sparc.64_cc/opal/util/../../../
  openmpi-1.7.4rc2r30323/opal/util/opal_getcwd.c"
dbx: warning: see `help finding-files'
t_at_1 (l_at_1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c"
(dbx)

Our M4000 server has no access to the source code, so that it couldn't
find the file. Nevertheless it is the same error message as above. Is it
possible that someone soves this problem? Thank you very much for any
help in advance.

Kind regards

Siegmar