Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with mpiexec --cpus-per-proc in multiple nodes in OMPI 1.6.4
From: Gus Correa (gus_at_[hidden])
Date: 2013-03-21 15:58:19


On 03/21/2013 03:12 PM, Reuti wrote:
> Am 21.03.2013 um 20:01 schrieb Gus Correa:
>
>> Dear Open MPI Pros
>>
>> I am having trouble using mpiexec with --cpus-per-proc
>> on multiple nodes in OMPI 1.6.4.
>>
>> I know there is an ongoing thread on similar runtime issues
>> of OMPI 1.7.
>> By no means I am trying to hijack T. Mishima's questions.
>> My question is genuine, though, and perhaps related to his.
>>
>> I am testing a new cluster remotely, with monster
>> dual socket 16-core AMD Bulldozer processors (32 cores per node).
>> I am using OMPI 1.6.4 built with Torque 4.2.1 support.
>>
>> I read that on these processors each pair of cores share an FPU.
>> Hence, I am trying to run *one MPI process* on each
>> *pair of successive cores*.
>> This trick seems to yield better performance
>> (at least for HPL/Linpack) than using all cores.
>> I.e., the goal is to use "each other core", or perhaps
>> to allow each process to wobble across two successive cores only,
>> hence granting exclusive use of one FPU per process.
>> [BTW, this is *not* an attempt to do hybrid MPI+OpenMP.
>> The code is HPL with MPI+BLAS/Lapack and NO OpenMP.]
>>
>> To achieve this, I am using the mpiexec --cpus-per-proc option.
>> It works on one node, which is great.
>> However, unless I made a silly syntax or arithmetic mistake,
>> it doesn't seem to work on more than one node.
>>
>> For instance, this works:
>>
>> #PBS -l nodes=1:ppn=32
>> ...
>> mpiexec -np 16 \
>> --cpus-per-proc 2 \
>> --bind-to-core \
>> --report-bindings \
>> --tag-output \
>>
>> I get a pretty nice process-to-cores distribution, with 16 processes, and each process bound to a couple of successive cores,
>> as expected:
>>
>> [1,7]<stderr>:[node33:04744] MCW rank 7 bound to socket 0[core 14-15]: [. . . . . . . . . . . . . . B B][. . . . . . . . . . . . . . . .]
>> [1,8]<stderr>:[node33:04744] MCW rank 8 bound to socket 1[core 0-1]: [. . . . . . . . . . . . . . . .][B B . . . . . . . . . . . . . .]
>> [1,9]<stderr>:[node33:04744] MCW rank 9 bound to socket 1[core 2-3]: [. . . . . . . . . . . . . . . .][. . B B . . . . . . . . . . . .]
>> [1,10]<stderr>:[node33:04744] MCW rank 10 bound to socket 1[core 4-5]: [. . . . . . . . . . . . . . . .][. . . . B B . . . . . . . . . .]
>> [1,11]<stderr>:[node33:04744] MCW rank 11 bound to socket 1[core 6-7]: [. . . . . . . . . . . . . . . .][. . . . . . B B . . . . . . . .]
>> [1,12]<stderr>:[node33:04744] MCW rank 12 bound to socket 1[core 8-9]: [. . . . . . . . . . . . . . . .][. . . . . . . . B B . . . . . .]
>> [1,13]<stderr>:[node33:04744] MCW rank 13 bound to socket 1[core 10-11]: [. . . . . . . . . . . . . . . .][. . . . . . . . . . B B . . . .]
>> [1,14]<stderr>:[node33:04744] MCW rank 14 bound to socket 1[core 12-13]: [. . . . . . . . . . . . . . . .][. . . . . . . . . . . . B B . .]
>> [1,15]<stderr>:[node33:04744] MCW rank 15 bound to socket 1[core 14-15]: [. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . B B]
>> [1,0]<stderr>:[node33:04744] MCW rank 0 bound to socket 0[core 0-1]: [B B . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .]
>> [1,1]<stderr>:[node33:04744] MCW rank 1 bound to socket 0[core 2-3]: [. . B B . . . . . . . . . . . .][. . . . . . . . . . . . . . . .]
>> [1,2]<stderr>:[node33:04744] MCW rank 2 bound to socket 0[core 4-5]: [. . . . B B . . . . . . . . . .][. . . . . . . . . . . . . . . .]
>> [1,3]<stderr>:[node33:04744] MCW rank 3 bound to socket 0[core 6-7]: [. . . . . . B B . . . . . . . .][. . . . . . . . . . . . . . . .]
>> [1,4]<stderr>:[node33:04744] MCW rank 4 bound to socket 0[core 8-9]: [. . . . . . . . B B . . . . . .][. . . . . . . . . . . . . . . .]
>> [1,5]<stderr>:[node33:04744] MCW rank 5 bound to socket 0[core 10-11]: [. . . . . . . . . . B B . . . .][. . . . . . . . . . . . . . . .]
>> [1,6]<stderr>:[node33:04744] MCW rank 6 bound to socket 0[core 12-13]: [. . . . . . . . . . . . B B . .][. . . . . . . . . . . . . . . .]
>>
>>
>> ***************
>>
>> However, when I try to use eight nodes,
>> the job fails and I get the error message below (repeatedly from
>> several nodes):
>>
>> #PBS -l nodes=8:ppn=32
>> ...
>> mpiexec -np 128 \
>> --cpus-per-proc 2 \
>> --bind-to-core \
>> --report-bindings \
>> --tag-output \
>>
>>
>> Error message:
>>
>> --------------------------------------------------------------------------
>> An invalid physical processor ID was returned when attempting to bind
>> an MPI process to a unique processor on node:
>>
>> Node: node18
>>
>> This usually means that you requested binding to more processors than
>> exist (e.g., trying to bind N MPI processes to M processors, where N>
>> M), or that the node has an unexpectedly different topology.
>>
>> Double check that you have enough unique processors for all the
>> MPI processes that you are launching on this host, and that all nodes
>> have identical topologies.
>>
>> You job will now abort.
>> --------------------------------------------------------------------------
>
> I got this when I ran it on the command line and specified a hostfile on my own. The weird thing was, that it was working fine when the job was submitted by SGE. Then the allocation was correct like the hostfile being honored only when it was assembled by Open MPI on its own from any given list of granted machines by SGE.
>
> Was Open MPI built with tm-support? I would expect it to work in the same way for Torque.
>
> -- Reuti
>

Thank you, Reuti.
Yes, OMPI 1.6.4 built --with-tm support (Torque 4.2.1).
I haven't tried a rankfile yet.
The problem happens with --cpus-per-proc and only when
more than one node is requested.
Gus Correa

>
>> Oddly enough, the binding map *is* shown on STDERR,
>> and it sounds *correct*, pretty much the same binding map above
>> that I get for a single node.
>>
>> *****************
>>
>> Finally, replacing "--cpus-per-core 2" by "--npernode 16"
>> works to some extent, but doesn't reach my goal.
>> I.e., the job doesn't fail, and each node gets 16 MPI
>> processes indeed.
>> However, it doesn't bind the processes the way I want.
>> Regardless of whether I continue to use "--bind-to-core"
>> or replace it by "--bind-to-socket"
>> all 16 processes on each node always bind to socket 0,
>> and nothing goes to socket 1.
>>
>> ************
>>
>> Is there any simple workaround to this
>> (other than using a --rankfile),
>> to make --cpus-per-proc work with multiple nodes,
>> using "each other core"?
>>
>> [Only if it is simple workaround. I must finish this
>> remote test soon. Otherwise I can revisit this issue later.]
>>
>> Thank you,
>> Gus Correa
>>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users