Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] problems with rankfile in openmpi-1.9a1r29097
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2013-09-02 10:10:43


Hi,

I installed openmpi-1.9a1r29097 on "openSuSE Linux 12.1", "Solaris 10
x86_64", and "Solaris 10 sparc" with "Sun C 5.12" in 64-bit mode.
Unfortunately I still have a problem with rankfiles. I reported the
problems already in May. I show the problems with Linux, although I
have the same problems on all Solaris machines as well.

linpc1 rankfiles 99 cat rf_linpc1
# mpiexec -report-bindings -rf rf_linpc1 hostname
rank 0=linpc1 slot=0:0-1,1:0-1

linpc1 rankfiles 100 mpiexec -report-bindings -rf rf_linpc1 hostname
[linpc1:23413] MCW rank 0 bound to socket 0[core 0[hwt 0]],
  socket 0[core 1[hwt 0]]: [B/B][./.]
linpc1

linpc1 rankfiles 101 cat rf_ex_linpc
# mpiexec -report-bindings -rf rf_ex_linpc hostname
rank 0=linpc0 slot=0:0-1,1:0-1
rank 1=linpc1 slot=0:0-1
rank 2=linpc1 slot=1:0
rank 3=linpc1 slot=1:1

linpc1 rankfiles 102 mpiexec -report-bindings -rf rf_ex_linpc hostname
--------------------------------------------------------------------------
The rankfile that was used claimed that a host was either not
allocated or oversubscribed its slots. Please review your rank-slot
assignments and your host allocation to ensure a proper match. Also,
some systems may require using full hostnames, such as
"host1.example.com" (instead of just plain "host1").

  Host: linpc0
--------------------------------------------------------------------------
linpc1 rankfiles 103

I don't have these problems with openmpi-1.6.5a1r28554.

linpc1 rankfiles 95 ompi_info | grep "Open MPI:"
                Open MPI: 1.6.5a1r28554

linpc1 rankfiles 95 mpiexec -report-bindings -rf rf_linpc1 hostname
[linpc1:23583] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
linpc1

linpc1 rankfiles 96 mpiexec -report-bindings -rf rf_ex_linpc hostname
[linpc1:23585] MCW rank 1 bound to socket 0[core 0-1]:
  [B B][. .] (slot list 0:0-1)
[linpc1:23585] MCW rank 2 bound to socket 1[core 0]:
  [. .][B .] (slot list 1:0)
[linpc1:23585] MCW rank 3 bound to socket 1[core 1]:
  [. .][. B] (slot list 1:1)
linpc1
linpc1
linpc1
[linpc0:10422] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
  [B B][B B] (slot list 0:0-1,1:0-1)
linpc0

I would be grateful, if somebody can fix the problem. Thank you
very much for any help in advance.

Kind regards

Siegmar