Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problems with rankfile in openmpi-1.9a1r29097
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2013-09-03 04:05:17


Hi,

> Okay, I have a fix for not specifying the number of procs when
> using a rankfile.
>
> As for the binding pattern, the problem is a syntax error in
> your rankfile. You need a semi-colon instead of a comma to
> separate the sockets for rank0:
>
> > rank 0=bend001 slot=0:0-1,1:0-1 => rank 0=bend001 slot=0:0-1;1:0-1
>
> This is required because you use commas to list specific cores
> - e.g., slot=0:0,1,4,6
...

OK, you have changed syntax. Open MPI 1.6.x needs "," and Open MPI
1.9.x needs ";". Unfortunately my rankfiles still don't work as
expected (even if I add "-np <number>", so that everything is specified
now). These are some of my rankfiles, which I use to show you different
errors.

::::::::::::::
rf_linpc_semicolon
::::::::::::::
# Open MPI 1.7.x and newer needs ";" to separate sockets.
# mpiexec -report-bindings -rf rf_linpc_semicolon -np 1 hostname
rank 0=linpc1 slot=0:0-1;1:0-1

::::::::::::::
rf_linpc_linpc_semicolon
::::::::::::::
# Open MPI 1.7.x and newer needs ";" to separate sockets.
# mpiexec -report-bindings -rf rf_linpc_linpc_semicolon -np 4 hostname
rank 0=linpc0 slot=0:0-1;1:0-1
rank 1=linpc1 slot=0:0-1
rank 2=linpc1 slot=1:0
rank 3=linpc1 slot=1:1

::::::::::::::
rf_tyr_semicolon
::::::::::::::
# Open MPI 1.7.x and newer needs ";" to separate sockets.
# mpiexec -report-bindings -rf rf_tyr_semicolon -np 1 hostname
rank 0=tyr slot=0:0;1:0
tyr rankfiles 198

These are my results. "linpc?" use Open-SuSE Linux, "sunpc?" use
Solaris 10 x86_64, and "tyr" uses Solaris 10 sparc. "linpc?" and
"sunpc?" use identical hardware.

tyr rankfiles 107 ompi_info | grep "Open MPI:"
                Open MPI: 1.9a1r29097

1) It seems that I can use the rankfile only on a node, which is
   specified in the rankfile.

linpc1 rankfiles 98 mpiexec -report-bindings \
  -rf rf_linpc_semicolon -np 1 hostname
[linpc1:12504] MCW rank 0 bound to socket 0[core 0[hwt 0]],
  socket 0[core 1[hwt 0]], socket 1[core 2[hwt 0]],
  socket 1[core 3[hwt 0]]: [B/B][B/B]
linpc1
linpc1 rankfiles 98 exit

tyr rankfiles 125 ssh sunpc1
...
sunpc1 rankfiles 102 mpiexec -report-bindings \
  -rf rf_linpc_semicolon -np 1 hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
sunpc1 rankfiles 103 exit

linpc0 rankfiles 93 mpiexec -report-bindings \
  -rf rf_linpc_semicolon -np 1 hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
linpc0 rankfiles 94 exit

I can use the rankfile on any machine with Open MPI 1.6.x.

tyr rankfiles 105 ompi_info | grep "Open MPI:"
                Open MPI: 1.6.5a1r28554

tyr rankfiles 106 mpiexec -report-bindings \
  -rf rf_linpc_semicolon -np 1 hostname
[tyr.informatik.hs-fulda.de:29380] Got an error!
[linpc1:12637] MCW rank 0 bound to socket 0[core 0-1]:
  [B B][. .] (slot list 0:0-1)
linpc1

Semicolon isn't allowed.

tyr rankfiles 107 mpiexec -report-bindings \
  -rf rf_linpc_comma -np 1 hostname
[linpc1:12704] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0,1,1:0,1)
linpc1
tyr rankfiles 108

2) I cannot use two Linux machines with Open MPI 1.9.x.

linpc1 rankfiles 105 mpiexec -report-bindings \
  -rf rf_linpc_linpc_semicolon -np 4 hostname
--------------------------------------------------------------------------
The rankfile that was used claimed that a host was either not
allocated or oversubscribed its slots. Please review your rank-slot
assignments and your host allocation to ensure a proper match. Also,
some systems may require using full hostnames, such as
"host1.example.com" (instead of just plain "host1").

  Host: linpc0
--------------------------------------------------------------------------
linpc1 rankfiles 106

Perhaps this problem is a follow-up of the above problem.

No problem with Open MPI 1.6.x.
   
linpc1 rankfiles 106 mpiexec -report-bindings \
  -rf rf_linpc_linpc_comma -np 4 hostname
[linpc1:12975] MCW rank 1 bound to socket 0[core 0-1]:
  [B B][. .] (slot list 0:0-1)
[linpc1:12975] MCW rank 2 bound to socket 1[core 0]:
  [. .][B .] (slot list 1:0)
[linpc1:12975] MCW rank 3 bound to socket 1[core 1]:
  [. .][. B] (slot list 1:1)
linpc1
linpc1
[linpc0:13855] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
linpc0
linpc1
linpc1 rankfiles 107

3) I have a problem on "tyr" (Solaris 10 sparc).

tyr rankfiles 106 mpiexec -report-bindings \
  -rf rf_tyr_semicolon -np 1 hostname
[tyr.informatik.hs-fulda.de:29849] [[53951,0],0] ORTE_ERROR_LOG:
  Not found in file
   ../../../../../openmpi-1.9a1r29097/orte/mca/rmaps/rank_file/rmaps_rank_file.c
   at line 276
[tyr.informatik.hs-fulda.de:29849] [[53951,0],0] ORTE_ERROR_LOG:
  Not found in file
  ../../../../openmpi-1.9a1r29097/orte/mca/rmaps/base/rmaps_base_map_job.c
  at line 173
tyr rankfiles 107

I get the following output, if I try the rankfile from a different machine
(also Solaris 10 sparc).

rs0 rankfiles 104 mpiexec -report-bindings -rf rf_tyr_semicolon -np 1 hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
rs0 rankfiles 105

This time I have also a small problem with Open MPI 1.6.x.

tyr rankfiles 134 ompi_info | grep "Open MPI:"
                Open MPI: 1.6.5a1r28554

tyr rankfiles 135 mpiexec -report-bindings \
  -rf rf_tyr_comma -np 1 hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

tyr rankfiles 136 ssh rs0
...
rs0 rankfiles 104 ompi_info | grep "Open MPI:"
                Open MPI: 1.6.5a1r28554

rs0 rankfiles 105 mpiexec -report-bindings \
  -rf rf_tyr_comma -np 1 hostname
[tyr.informatik.hs-fulda.de:29770] MCW rank 0 bound to
  socket 0[core 0] socket 1[core 0]: [B][B] (slot list 0:0,1:0)
tyr.informatik.hs-fulda.de
rs0 rankfiles 106

Why doesn't it work, if I'm logged in into the machine in the
rankfile, while it works, if I'm using the rankfile on a different
machine? Thank you very much for any help in advance.

Kind regards

Siegmar

> On Sep 2, 2013, at 7:52 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> > It seems to run for me on CentOS, though I note that rank 0 isn't
> bound to both sockets 0 and 1 as specified and I had to tell it how
> many procs to run:
> >
> > [rhc_at_bend001 svn-trunk]$ mpirun --report-bindings
> > -rf rf -n 4 hostname
> > [bend001:13297] MCW rank 0 bound to socket 0[core 0[hwt 0-1]],
> > socket 0[core 1[hwt 0-1]]: [BB/BB/../../../..][../../../../../..]
> > bend001
> > [bend002:25899] MCW rank 3 bound to socket 1[core 7[hwt 0-1]]:
> > [../../../../../..][../BB/../../../..]
> > bend002
> > [bend002:25899] MCW rank 1 bound to socket 0[core 0[hwt 0-1]],
> > socket 0[core 1[hwt 0-1]]: [BB/BB/../../../..][../../../../../..]
> > bend002
> > [bend002:25899] MCW rank 2 bound to socket 1[core 6[hwt 0-1]]:
> > [../../../../../..][BB/../../../../..]
> > bend002
> >
> > [rhc_at_bend001 svn-trunk]$ cat rf
> > rank 0=bend001 slot=0:0-1,1:0-1
> > rank 1=bend002 slot=0:0-1
> > rank 2=bend002 slot=1:0
> > rank 3=bend002 slot=1:1
> >
> > I'll work on those issues, but don't know why you are getting
> > this "not allocated" error.
> >
> >
> > On Sep 2, 2013, at 7:10 AM, Siegmar Gross
> > <Siegmar.Gross_at_[hidden]> wrote:
> >
> >> Hi,
> >>
> >> I installed openmpi-1.9a1r29097 on "openSuSE Linux 12.1", "Solaris 10
> >> x86_64", and "Solaris 10 sparc" with "Sun C 5.12" in 64-bit mode.
> >> Unfortunately I still have a problem with rankfiles. I reported the
> >> problems already in May. I show the problems with Linux, although I
> >> have the same problems on all Solaris machines as well.
> >>
> >> linpc1 rankfiles 99 cat rf_linpc1
> >> # mpiexec -report-bindings -rf rf_linpc1 hostname
> >> rank 0=linpc1 slot=0:0-1,1:0-1
> >>
> >> linpc1 rankfiles 100 mpiexec -report-bindings -rf rf_linpc1 hostname
> >> [linpc1:23413] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> >> socket 0[core 1[hwt 0]]: [B/B][./.]
> >> linpc1
> >>
> >>
> >> linpc1 rankfiles 101 cat rf_ex_linpc
> >> # mpiexec -report-bindings -rf rf_ex_linpc hostname
> >> rank 0=linpc0 slot=0:0-1,1:0-1
> >> rank 1=linpc1 slot=0:0-1
> >> rank 2=linpc1 slot=1:0
> >> rank 3=linpc1 slot=1:1
> >>
> >> linpc1 rankfiles 102 mpiexec -report-bindings -rf rf_ex_linpc hostname
> >> --------------------------------------------------------------------------
> >> The rankfile that was used claimed that a host was either not
> >> allocated or oversubscribed its slots. Please review your rank-slot
> >> assignments and your host allocation to ensure a proper match. Also,
> >> some systems may require using full hostnames, such as
> >> "host1.example.com" (instead of just plain "host1").
> >>
> >> Host: linpc0
> >> --------------------------------------------------------------------------
> >> linpc1 rankfiles 103
> >>
> >>
> >>
> >> I don't have these problems with openmpi-1.6.5a1r28554.
> >>
> >> linpc1 rankfiles 95 ompi_info | grep "Open MPI:"
> >> Open MPI: 1.6.5a1r28554
> >>
> >> linpc1 rankfiles 95 mpiexec -report-bindings -rf rf_linpc1 hostname
> >> [linpc1:23583] MCW rank 0 bound to socket 0[core 0-1]
> >> socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)
> >> linpc1
> >>
> >>
> >> linpc1 rankfiles 96 mpiexec -report-bindings -rf rf_ex_linpc hostname
> >> [linpc1:23585] MCW rank 1 bound to socket 0[core 0-1]:
> >> [B B][. .] (slot list 0:0-1)
> >> [linpc1:23585] MCW rank 2 bound to socket 1[core 0]:
> >> [. .][B .] (slot list 1:0)
> >> [linpc1:23585] MCW rank 3 bound to socket 1[core 1]:
> >> [. .][. B] (slot list 1:1)
> >> linpc1
> >> linpc1
> >> linpc1
> >> [linpc0:10422] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]:
> >> [B B][B B] (slot list 0:0-1,1:0-1)
> >> linpc0
> >>
> >>
> >> I would be grateful, if somebody can fix the problem. Thank you
> >> very much for any help in advance.
> >>
> >>
> >> Kind regards
> >>
> >> Siegmar
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
>