Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Problem with mpiexec --cpus-per-proc in multiple nodes in OMPI 1.6.4
From: Gus Correa (gus_at_[hidden])
Date: 2013-03-21 15:01:00


Dear Open MPI Pros

I am having trouble using mpiexec with --cpus-per-proc
on multiple nodes in OMPI 1.6.4.

I know there is an ongoing thread on similar runtime issues
of OMPI 1.7.
By no means I am trying to hijack T. Mishima's questions.
My question is genuine, though, and perhaps related to his.

I am testing a new cluster remotely, with monster
dual socket 16-core AMD Bulldozer processors (32 cores per node).
I am using OMPI 1.6.4 built with Torque 4.2.1 support.

I read that on these processors each pair of cores share an FPU.
Hence, I am trying to run *one MPI process* on each
*pair of successive cores*.
This trick seems to yield better performance
(at least for HPL/Linpack) than using all cores.
I.e., the goal is to use "each other core", or perhaps
to allow each process to wobble across two successive cores only,
hence granting exclusive use of one FPU per process.
[BTW, this is *not* an attempt to do hybrid MPI+OpenMP.
The code is HPL with MPI+BLAS/Lapack and NO OpenMP.]

To achieve this, I am using the mpiexec --cpus-per-proc option.
It works on one node, which is great.
However, unless I made a silly syntax or arithmetic mistake,
it doesn't seem to work on more than one node.

For instance, this works:

#PBS -l nodes=1:ppn=32
...
mpiexec -np 16 \
     --cpus-per-proc 2 \
     --bind-to-core \
     --report-bindings \
     --tag-output \

I get a pretty nice process-to-cores distribution, with 16 processes,
and each process bound to a couple of successive cores,
as expected:

[1,7]<stderr>:[node33:04744] MCW rank 7 bound to socket 0[core 14-15]:
[. . . . . . . . . . . . . . B B][. . . . . . . . . . . . . . . .]
[1,8]<stderr>:[node33:04744] MCW rank 8 bound to socket 1[core 0-1]: [.
. . . . . . . . . . . . . . .][B B . . . . . . . . . . . . . .]
[1,9]<stderr>:[node33:04744] MCW rank 9 bound to socket 1[core 2-3]: [.
. . . . . . . . . . . . . . .][. . B B . . . . . . . . . . . .]
[1,10]<stderr>:[node33:04744] MCW rank 10 bound to socket 1[core 4-5]:
[. . . . . . . . . . . . . . . .][. . . . B B . . . . . . . . . .]
[1,11]<stderr>:[node33:04744] MCW rank 11 bound to socket 1[core 6-7]:
[. . . . . . . . . . . . . . . .][. . . . . . B B . . . . . . . .]
[1,12]<stderr>:[node33:04744] MCW rank 12 bound to socket 1[core 8-9]:
[. . . . . . . . . . . . . . . .][. . . . . . . . B B . . . . . .]
[1,13]<stderr>:[node33:04744] MCW rank 13 bound to socket 1[core 10-11]:
[. . . . . . . . . . . . . . . .][. . . . . . . . . . B B . . . .]
[1,14]<stderr>:[node33:04744] MCW rank 14 bound to socket 1[core 12-13]:
[. . . . . . . . . . . . . . . .][. . . . . . . . . . . . B B . .]
[1,15]<stderr>:[node33:04744] MCW rank 15 bound to socket 1[core 14-15]:
[. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . B B]
[1,0]<stderr>:[node33:04744] MCW rank 0 bound to socket 0[core 0-1]: [B
B . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .]
[1,1]<stderr>:[node33:04744] MCW rank 1 bound to socket 0[core 2-3]: [.
. B B . . . . . . . . . . . .][. . . . . . . . . . . . . . . .]
[1,2]<stderr>:[node33:04744] MCW rank 2 bound to socket 0[core 4-5]: [.
. . . B B . . . . . . . . . .][. . . . . . . . . . . . . . . .]
[1,3]<stderr>:[node33:04744] MCW rank 3 bound to socket 0[core 6-7]: [.
. . . . . B B . . . . . . . .][. . . . . . . . . . . . . . . .]
[1,4]<stderr>:[node33:04744] MCW rank 4 bound to socket 0[core 8-9]: [.
. . . . . . . B B . . . . . .][. . . . . . . . . . . . . . . .]
[1,5]<stderr>:[node33:04744] MCW rank 5 bound to socket 0[core 10-11]:
[. . . . . . . . . . B B . . . .][. . . . . . . . . . . . . . . .]
[1,6]<stderr>:[node33:04744] MCW rank 6 bound to socket 0[core 12-13]:
[. . . . . . . . . . . . B B . .][. . . . . . . . . . . . . . . .]

***************

However, when I try to use eight nodes,
the job fails and I get the error message below (repeatedly from
several nodes):

#PBS -l nodes=8:ppn=32
...
mpiexec -np 128 \
     --cpus-per-proc 2 \
     --bind-to-core \
     --report-bindings \
     --tag-output \

Error message:

--------------------------------------------------------------------------
An invalid physical processor ID was returned when attempting to bind
an MPI process to a unique processor on node:

   Node: node18

This usually means that you requested binding to more processors than
exist (e.g., trying to bind N MPI processes to M processors, where N >
M), or that the node has an unexpectedly different topology.

Double check that you have enough unique processors for all the
MPI processes that you are launching on this host, and that all nodes
have identical topologies.

You job will now abort.
--------------------------------------------------------------------------

Oddly enough, the binding map *is* shown on STDERR,
and it sounds *correct*, pretty much the same binding map above
that I get for a single node.

*****************

Finally, replacing "--cpus-per-core 2" by "--npernode 16"
works to some extent, but doesn't reach my goal.
I.e., the job doesn't fail, and each node gets 16 MPI
processes indeed.
However, it doesn't bind the processes the way I want.
Regardless of whether I continue to use "--bind-to-core"
or replace it by "--bind-to-socket"
all 16 processes on each node always bind to socket 0,
and nothing goes to socket 1.

************

Is there any simple workaround to this
(other than using a --rankfile),
to make --cpus-per-proc work with multiple nodes,
using "each other core"?

[Only if it is simple workaround. I must finish this
remote test soon. Otherwise I can revisit this issue later.]

Thank you,
Gus Correa