Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] one more problem with process bindings on openmpi-1.6.2
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2012-10-03 11:40:28


Hi,

> As I said, in the absence of a hostfile, -host assigns ONE slot for
> each time a host is named. So the equivalent hostfile would have
> "slots=1" to create the same pattern as your -host cmd line.

That would mean that a hostfile has nothing to do with the underlying
hardware and that it would be a mystery to find out how to set it up.
Now I found a different solution so that I'm a little bit satisfied that
I don't need a different hostfile for every "mpiexec" command. I
sorted the output and removed the output from "hostname" so that
everything is more readable. Is the keyword "sockets" available in
openmpi-1.7 and openmpi-1.9 as well?

tyr fd1026 252 cat host_sunpc0_1
sunpc0 sockets=2 slots=4
sunpc1 sockets=2 slots=4

tyr fd1026 253 mpiexec -report-bindings -hostfile host_sunpc0_1 \
  -np 4 -npersocket 1 -cpus-per-proc 2 -bynode -bind-to-core hostname
[sunpc0:12641] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
[sunpc1:01402] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .]
[sunpc0:12641] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B]
[sunpc1:01402] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B]

tyr fd1026 254 mpiexec -report-bindings -host sunpc0,sunpc1 \
  -np 4 -cpus-per-proc 2 -bind-to-core -bysocket hostname
[sunpc0:12676] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
[sunpc1:01437] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .]
[sunpc0:12676] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B]
[sunpc1:01437] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B]

tyr fd1026 258 mpiexec -report-bindings -hostfile host_sunpc0_1 \
  -np 2 -npernode 1 -cpus-per-proc 4 -bind-to-core hostname
[sunpc0:12833] MCW rank 0 bound to socket 0[core 0-1]
                                   socket 1[core 0-1]: [B B][B B]
[sunpc1:01561] MCW rank 1 bound to socket 0[core 0-1]
                                   socket 1[core 0-1]: [B B][B B]

tyr fd1026 259 mpiexec -report-bindings -host sunpc0,sunpc1 \
  -np 2 -cpus-per-proc 4 -bind-to-core hostname
[sunpc0:12869] MCW rank 0 bound to socket 0[core 0-1]
                                   socket 1[core 0-1]: [B B][B B]
[sunpc1:01600] MCW rank 1 bound to socket 0[core 0-1]
                                   socket 1[core 0-1]: [B B][B B]

Thank you very much for your answers and your time. I have learned
a lot about process bindings through our discussion. Now I'm waiting
for a bug fix for my problem with rankfiles. :-))

Kind regards

Siegmar

> On Oct 3, 2012, at 7:12 AM, Siegmar Gross
<Siegmar.Gross_at_[hidden]> wrote:
>
> > Hi,
> >
> > I thought that "slot" is the smallest manageable entity so that I
> > must set "slot=4" for a dual-processor dual-core machine with one
> > hardware-thread per core. Today I learned about the new keyword
> > "sockets" for a hostfile (I didn't find it in "man orte_hosts").
> > How would I specify a system with two dual-core processors so that
> > "mpiexec -report-bindings -hostfile host_sunpc0_1 -np 4
> > -cpus-per-proc 2 -bind-to-core hostname" or even
> > "mpiexec -report-bindings -hostfile host_sunpc0_1 -np 2
> > -cpus-per-proc 4 -bind-to-core hostname" would work in the same way
> > as the commands below.
> >
> > tyr fd1026 217 mpiexec -report-bindings -host sunpc0,sunpc1 -np 2 \
> > -cpus-per-proc 4 -bind-to-core hostname
> > [sunpc0:11658] MCW rank 0 bound to socket 0[core 0-1]
> > socket 1[core 0-1]: [B B][B B]
> > sunpc0
> > [sunpc1:00553] MCW rank 1 bound to socket 0[core 0-1]
> > socket 1[core 0-1]: [B B][B B]
> > sunpc1
> >
> >
> > Thank you very much for your help in advance.
> >
> >
> > Kind regards
> >
> > Siegmar
> >
> >
> >
> >>> I recognized another problem with procecss bindings. The command
> >>> works, if I use "-host" and it breaks, if I use "-hostfile" with
> >>> the same machines.
> >>>
> >>> tyr fd1026 178 mpiexec -report-bindings -host sunpc0,sunpc1 -np 4 \
> >>> -cpus-per-proc 2 -bind-to-core hostname
> >>> sunpc1
> >>> [sunpc1:00086] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .]
> >>> [sunpc1:00086] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B]
> >>> sunpc0
> >>> [sunpc0:10929] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
> >>> sunpc0
> >>> [sunpc0:10929] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B]
> >>> sunpc1
> >>>
> >>>
> >>
> >> Yes, this works because you told us there is only ONE slot on each
> >> host. As a result, we split the 4 processes across the two hosts
> >> (both of which are now oversubscribed), resulting in TWO processes
> >> running on each host. Since there are 4 cores on each host, and
> >> you asked for 2 cores/process, we can make this work.
> >>
> >>
> >>> tyr fd1026 179 cat host_sunpc0_1
> >>> sunpc0 slots=4
> >>> sunpc1 slots=4
> >>>
> >>>
> >>> tyr fd1026 180 mpiexec -report-bindings -hostfile host_sunpc0_1 -np 4 \
> >>> -cpus-per-proc 2 -bind-to-core hostname
> >>
> >> And this will of course not work. In your hostfile, you told us there
> >> are FOUR slots on each host. Since the default is to map by slot, we
> >> correctly mapped all four processes to the first node. We then tried
> >> to bind 2 cores for each process, resulting in 8 cores - which is
> >> more than you have.
> >>
> >>
> >>> --------------------------------------------------------------------------
> >>> An invalid physical processor ID was returned when attempting to bind
> >>> an MPI process to a unique processor.
> >>>
> >>> This usually means that you requested binding to more processors than
> >>> exist (e.g., trying to bind N MPI processes to M processors, where N >
> >>> M). Double check that you have enough unique processors for all the
> >>> MPI processes that you are launching on this host.
> >>>
> >>> You job will now abort.
> >>> --------------------------------------------------------------------------
> >>> sunpc0
> >>> [sunpc0:10964] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
> >>> sunpc0
> >>> [sunpc0:10964] MCW rank 1 bound to socket 1[core 0-1]: [. .][B B]
> >>> --------------------------------------------------------------------------
> >>> mpiexec was unable to start the specified application as it encountered
> >>> an error
> >>> on node sunpc0. More information may be available above.
> >>> --------------------------------------------------------------------------
> >>> 4 total processes failed to start
> >>>
> >>>
> >>> Perhaps this error is related to the other errors. Thank you very
> >>> much for any help in advance.
> >>>
> >>>
> >>> Kind regards
> >>>
> >>> Siegmar
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >
>
>