Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] one more problem with process bindings on openmpi-1.6.2
From: Siegmar Gross (Siegmar.Gross_at_[hidden])
Date: 2012-10-03 11:40:28


Hi,

> As I said, in the absence of a hostfile, -host assigns ONE slot for
> each time a host is named. So the equivalent hostfile would have
> "slots=1" to create the same pattern as your -host cmd line.

That would mean that a hostfile has nothing to do with the underlying
hardware and that it would be a mystery to find out how to set it up.
Now I found a different solution so that I'm a little bit satisfied that
I don't need a different hostfile for every "mpiexec" command. I
sorted the output and removed the output from "hostname" so that
everything is more readable. Is the keyword "sockets" available in
openmpi-1.7 and openmpi-1.9 as well?

tyr fd1026 252 cat host_sunpc0_1
sunpc0 sockets=2 slots=4
sunpc1 sockets=2 slots=4

tyr fd1026 253 mpiexec -report-bindings -hostfile host_sunpc0_1 \
  -np 4 -npersocket 1 -cpus-per-proc 2 -bynode -bind-to-core hostname
[sunpc0:12641] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
[sunpc1:01402] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .]
[sunpc0:12641] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B]
[sunpc1:01402] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B]

tyr fd1026 254 mpiexec -report-bindings -host sunpc0,sunpc1 \
  -np 4 -cpus-per-proc 2 -bind-to-core -bysocket hostname
[sunpc0:12676] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
[sunpc1:01437] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .]
[sunpc0:12676] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B]
[sunpc1:01437] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B]

tyr fd1026 258 mpiexec -report-bindings -hostfile host_sunpc0_1 \
  -np 2 -npernode 1 -cpus-per-proc 4 -bind-to-core hostname
[sunpc0:12833] MCW rank 0 bound to socket 0[core 0-1]
                                   socket 1[core 0-1]: [B B][B B]
[sunpc1:01561] MCW rank 1 bound to socket 0[core 0-1]
                                   socket 1[core 0-1]: [B B][B B]

tyr fd1026 259 mpiexec -report-bindings -host sunpc0,sunpc1 \
  -np 2 -cpus-per-proc 4 -bind-to-core hostname
[sunpc0:12869] MCW rank 0 bound to socket 0[core 0-1]
                                   socket 1[core 0-1]: [B B][B B]
[sunpc1:01600] MCW rank 1 bound to socket 0[core 0-1]
                                   socket 1[core 0-1]: [B B][B B]

Thank you very much for your answers and your time. I have learned
a lot about process bindings through our discussion. Now I'm waiting
for a bug fix for my problem with rankfiles. :-))

Kind regards

Siegmar

> On Oct 3, 2012, at 7:12 AM, Siegmar Gross
<Siegmar.Gross_at_[hidden]> wrote:
>
> > Hi,
> >
> > I thought that "slot" is the smallest manageable entity so that I
> > must set "slot=4" for a dual-processor dual-core machine with one
> > hardware-thread per core. Today I learned about the new keyword
> > "sockets" for a hostfile (I didn't find it in "man orte_hosts").
> > How would I specify a system with two dual-core processors so that
> > "mpiexec -report-bindings -hostfile host_sunpc0_1 -np 4
> > -cpus-per-proc 2 -bind-to-core hostname" or even
> > "mpiexec -report-bindings -hostfile host_sunpc0_1 -np 2
> > -cpus-per-proc 4 -bind-to-core hostname" would work in the same way
> > as the commands below.
> >
> > tyr fd1026 217 mpiexec -report-bindings -host sunpc0,sunpc1 -np 2 \
> > -cpus-per-proc 4 -bind-to-core hostname
> > [sunpc0:11658] MCW rank 0 bound to socket 0[core 0-1]
> > socket 1[core 0-1]: [B B][B B]
> > sunpc0
> > [sunpc1:00553] MCW rank 1 bound to socket 0[core 0-1]
> > socket 1[core 0-1]: [B B][B B]
> > sunpc1
> >
> >
> > Thank you very much for your help in advance.
> >
> >
> > Kind regards
> >
> > Siegmar
> >
> >
> >
> >>> I recognized another problem with procecss bindings. The command
> >>> works, if I use "-host" and it breaks, if I use "-hostfile" with
> >>> the same machines.
> >>>
> >>> tyr fd1026 178 mpiexec -report-bindings -host sunpc0,sunpc1 -np 4 \
> >>> -cpus-per-proc 2 -bind-to-core hostname
> >>> sunpc1
> >>> [sunpc1:00086] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .]
> >>> [sunpc1:00086] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B]
> >>> sunpc0
> >>> [sunpc0:10929] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
> >>> sunpc0
> >>> [sunpc0:10929] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B]
> >>> sunpc1
> >>>
> >>>
> >>
> >> Yes, this works because you told us there is only ONE slot on each
> >> host. As a result, we split the 4 processes across the two hosts
> >> (both of which are now oversubscribed), resulting in TWO processes
> >> running on each host. Since there are 4 cores on each host, and
> >> you asked for 2 cores/process, we can make this work.
> >>
> >>
> >>> tyr fd1026 179 cat host_sunpc0_1
> >>> sunpc0 slots=4
> >>> sunpc1 slots=4
> >>>
> >>>
> >>> tyr fd1026 180 mpiexec -report-bindings -hostfile host_sunpc0_1 -np 4 \
> >>> -cpus-per-proc 2 -bind-to-core hostname
> >>
> >> And this will of course not work. In your hostfile, you told us there
> >> are FOUR slots on each host. Since the default is to map by slot, we
> >> correctly mapped all four processes to the first node. We then tried
> >> to bind 2 cores for each process, resulting in 8 cores - which is
> >> more than you have.
> >>
> >>
> >>> --------------------------------------------------------------------------
> >>> An invalid physical processor ID was returned when attempting to bind
> >>> an MPI process to a unique processor.
> >>>
> >>> This usually means that you requested binding to more processors than
> >>> exist (e.g., trying to bind N MPI processes to M processors, where N >
> >>> M). Double check that you have enough unique processors for all the
> >>> MPI processes that you are launching on this host.
> >>>
> >>> You job will now abort.
> >>> --------------------------------------------------------------------------
> >>> sunpc0
> >>> [sunpc0:10964] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
> >>> sunpc0
> >>> [sunpc0:10964] MCW rank 1 bound to socket 1[core 0-1]: [. .][B B]
> >>> --------------------------------------------------------------------------
> >>> mpiexec was unable to start the specified application as it encountered
> >>> an error
> >>> on node sunpc0. More information may be available above.
> >>> --------------------------------------------------------------------------
> >>> 4 total processes failed to start
> >>>
> >>>
> >>> Perhaps this error is related to the other errors. Thank you very
> >>> much for any help in advance.
> >>>
> >>>
> >>> Kind regards
> >>>
> >>> Siegmar
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >
>
>