Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] question to binding options in openmpi-1.6.2
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-10-02 09:48:06


On Oct 2, 2012, at 2:44 AM, Siegmar Gross <Siegmar.Gross_at_[hidden]> wrote:

> Hi,
>
> I tried to reproduce the bindings from the following blog
> http://blogs.cisco.com/performance/open-mpi-v1-5-processor-affinity-options
> on a machine with two dual-core processors and openmpi-1.6.2. I have
> ordered the lines and removed the output from "hostname" so that it
> is easier to see the bindings.
>
> mpiexec -report-bindings -host sunpc0 -np 4 -bind-to-socket hostname
> [sunpc0:05410] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
> [sunpc0:05410] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .]
> [sunpc0:05410] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B]
> [sunpc0:05410] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B]
>
> The output is consistent with the illustration in the above blog.
> Now I add one more machine.
>
> mpiexec -report-bindings -host sunpc0,sunpc1 -np 4 \
> -bind-to-socket hostname
> [sunpc0:06015] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
> [sunpc1:25543] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .]
> [sunpc0:06015] MCW rank 2 bound to socket 0[core 0-1]: [B B][. .]
> [sunpc1:25543] MCW rank 3 bound to socket 0[core 0-1]: [B B][. .]
>
> I would have expected the same output as before and not a distribution
> of the processes across both nodes. Did I misunderstand the concept
> so that the output is correct?

The output is correct. The key is in your -host specification. In the absence of an allocation or hostfile giving further slot information, this indicates there is one slot on each host. Oversubscription is allowed by default, else this would have exited as an error due to insufficient slots. Instead, what happens is that we map the 1st proc to the first node, which "fills" its one slot allocation. We therefore move to the next node and "fill" it with rank 1. Since both nodes are now "oversubscribed", we just balance the remaining procs across the available nodes.

> When I try "-bysocket" with one
> machine, I get once more a consistent output to the above blog.
>
> mpiexec -report-bindings -host sunpc0 -np 4 -bysocket \
> -bind-to-socket hostname
> [sunpc0:05451] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
> [sunpc0:05451] MCW rank 1 bound to socket 1[core 0-1]: [. .][B B]
> [sunpc0:05451] MCW rank 2 bound to socket 0[core 0-1]: [B B][. .]
> [sunpc0:05451] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B]
>
> However I get once more an unexpected output when I add one more
> machine and not the expected output from above.
>
> mpiexec -report-bindings -host sunpc0,sunpc1 -np 4 -bysocket \
> -bind-to-socket hostname
> [sunpc0:06130] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
> [sunpc1:25660] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .]
> [sunpc0:06130] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B]
> [sunpc1:25660] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B]

Same reason as above.

>
> I would have expected a distribution of the processes across all
> nodes, if I would have used "-bynode" (as in the following example).
>
> mpiexec -report-bindings -host sunpc0,sunpc1 -np 4 -bynode \
> -bind-to-socket hostname
> [sunpc0:06171] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
> [sunpc1:25696] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .]
> [sunpc0:06171] MCW rank 2 bound to socket 0[core 0-1]: [B B][. .]
> [sunpc1:25696] MCW rank 3 bound to socket 0[core 0-1]: [B B][. .]
>
>
> Option "-npersocket" doesnt't work, even if I reduce "-npersocket"
> to "1". Why doesn't it find any sockets, although the above commands
> could find both sockets?
>
> mpiexec -report-bindings -host sunpc0 -np 2 -npersocket 1 hostname
> --------------------------------------------------------------------------
> Your job has requested a conflicting number of processes for the
> application:
>
> App: hostname
> number of procs: 2
>
> This is more processes than we can launch under the following
> additional directives and conditions:
>
> number of sockets: 0
> npersocket: 1
>
> Please revise the conflict and try again.
> --------------------------------------------------------------------------

No idea - will have to look at the code to find the bug.

>
>
> By the way I get the same output if I use Linux instead of Solaris.
> I would be grateful if somebody could clarify if I misunderstood the
> binding concept or if the binding is wrong if I use more than one
> machine. Thank you very much for any comments in advance.
>
>
> Kind regards
>
> Siegmar
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users