Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with mpiexec --cpus-per-proc in multiple nodes in OMPI 1.6.4
From: Gus Correa (gus_at_[hidden])
Date: 2013-03-29 10:53:18


Thank you, Ralph!
Gus Correa

On 03/29/2013 09:33 AM, Ralph Castain wrote:
> Just an update: I have this fixed in the OMPI trunk. It didn't make 1.7.0, but will be in 1.7.1 and beyond.
>
>
> On Mar 21, 2013, at 2:09 PM, Gus Correa<gus_at_[hidden]> wrote:
>
>> Thank you, Ralph.
>>
>> I will try to use a rankfile.
>>
>> In any case, the --cpus-per-proc option is a very useful feature:
>> for hybrid MPI+OpenMP programs, for these processors with one FPU
>> shared by two cores, etc.
>> If it gets fixed in a later release of OMPI that would be great.
>>
>> Thank you,
>> Gus Correa
>>
>>
>> On 03/21/2013 04:03 PM, Ralph Castain wrote:
>>> I've heard this from a couple of other sources -
>> it looks like there is a problem on the daemons when
>> they compute the location for -cpus-per-proc.
>> I'm not entirely sure why that would be as the code
>> is supposed to be common with mpirun, but there are
>> a few differences.
>>> I will take a look at it - I don't know of any workaround,
>> I'm afraid.
>>> On Mar 21, 2013, at 12:01 PM, Gus Correa<gus_at_[hidden]> wrote:
>>>
>>>> Dear Open MPI Pros
>>>>
>>>> I am having trouble using mpiexec with --cpus-per-proc
>>>> on multiple nodes in OMPI 1.6.4.
>>>>
>>>> I know there is an ongoing thread on similar runtime issues
>>>> of OMPI 1.7.
>>>> By no means I am trying to hijack T. Mishima's questions.
>>>> My question is genuine, though, and perhaps related to his.
>>>>
>>>> I am testing a new cluster remotely, with monster
>>>> dual socket 16-core AMD Bulldozer processors (32 cores per node).
>>>> I am using OMPI 1.6.4 built with Torque 4.2.1 support.
>>>>
>>>> I read that on these processors each pair of cores share an FPU.
>>>> Hence, I am trying to run *one MPI process* on each
>>>> *pair of successive cores*.
>>>> This trick seems to yield better performance
>>>> (at least for HPL/Linpack) than using all cores.
>>>> I.e., the goal is to use "each other core", or perhaps
>>>> to allow each process to wobble across two successive cores only,
>>>> hence granting exclusive use of one FPU per process.
>>>> [BTW, this is *not* an attempt to do hybrid MPI+OpenMP.
>>>> The code is HPL with MPI+BLAS/Lapack and NO OpenMP.]
>>>>
>>>> To achieve this, I am using the mpiexec --cpus-per-proc option.
>>>> It works on one node, which is great.
>>>> However, unless I made a silly syntax or arithmetic mistake,
>>>> it doesn't seem to work on more than one node.
>>>>
>>>> For instance, this works:
>>>>
>>>> #PBS -l nodes=1:ppn=32
>>>> ...
>>>> mpiexec -np 16 \
>>>> --cpus-per-proc 2 \
>>>> --bind-to-core \
>>>> --report-bindings \
>>>> --tag-output \
>>>>
>>>> I get a pretty nice process-to-cores distribution, with 16 processes, and each process bound to a couple of successive cores,
>>>> as expected:
>>>>
>>>> [1,7]<stderr>:[node33:04744] MCW rank 7 bound to socket 0[core 14-15]: [. . . . . . . . . . . . . . B B][. . . . . . . . . . . . . . . .]
>>>> [1,8]<stderr>:[node33:04744] MCW rank 8 bound to socket 1[core 0-1]: [. . . . . . . . . . . . . . . .][B B . . . . . . . . . . . . . .]
>>>> [1,9]<stderr>:[node33:04744] MCW rank 9 bound to socket 1[core 2-3]: [. . . . . . . . . . . . . . . .][. . B B . . . . . . . . . . . .]
>>>> [1,10]<stderr>:[node33:04744] MCW rank 10 bound to socket 1[core 4-5]: [. . . . . . . . . . . . . . . .][. . . . B B . . . . . . . . . .]
>>>> [1,11]<stderr>:[node33:04744] MCW rank 11 bound to socket 1[core 6-7]: [. . . . . . . . . . . . . . . .][. . . . . . B B . . . . . . . .]
>>>> [1,12]<stderr>:[node33:04744] MCW rank 12 bound to socket 1[core 8-9]: [. . . . . . . . . . . . . . . .][. . . . . . . . B B . . . . . .]
>>>> [1,13]<stderr>:[node33:04744] MCW rank 13 bound to socket 1[core 10-11]: [. . . . . . . . . . . . . . . .][. . . . . . . . . . B B . . . .]
>>>> [1,14]<stderr>:[node33:04744] MCW rank 14 bound to socket 1[core 12-13]: [. . . . . . . . . . . . . . . .][. . . . . . . . . . . . B B . .]
>>>> [1,15]<stderr>:[node33:04744] MCW rank 15 bound to socket 1[core 14-15]: [. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . B B]
>>>> [1,0]<stderr>:[node33:04744] MCW rank 0 bound to socket 0[core 0-1]: [B B . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .]
>>>> [1,1]<stderr>:[node33:04744] MCW rank 1 bound to socket 0[core 2-3]: [. . B B . . . . . . . . . . . .][. . . . . . . . . . . . . . . .]
>>>> [1,2]<stderr>:[node33:04744] MCW rank 2 bound to socket 0[core 4-5]: [. . . . B B . . . . . . . . . .][. . . . . . . . . . . . . . . .]
>>>> [1,3]<stderr>:[node33:04744] MCW rank 3 bound to socket 0[core 6-7]: [. . . . . . B B . . . . . . . .][. . . . . . . . . . . . . . . .]
>>>> [1,4]<stderr>:[node33:04744] MCW rank 4 bound to socket 0[core 8-9]: [. . . . . . . . B B . . . . . .][. . . . . . . . . . . . . . . .]
>>>> [1,5]<stderr>:[node33:04744] MCW rank 5 bound to socket 0[core 10-11]: [. . . . . . . . . . B B . . . .][. . . . . . . . . . . . . . . .]
>>>> [1,6]<stderr>:[node33:04744] MCW rank 6 bound to socket 0[core 12-13]: [. . . . . . . . . . . . B B . .][. . . . . . . . . . . . . . . .]
>>>>
>>>>
>>>> ***************
>>>>
>>>> However, when I try to use eight nodes,
>>>> the job fails and I get the error message below (repeatedly from
>>>> several nodes):
>>>>
>>>> #PBS -l nodes=8:ppn=32
>>>> ...
>>>> mpiexec -np 128 \
>>>> --cpus-per-proc 2 \
>>>> --bind-to-core \
>>>> --report-bindings \
>>>> --tag-output \
>>>>
>>>>
>>>> Error message:
>>>>
>>>> --------------------------------------------------------------------------
>>>> An invalid physical processor ID was returned when attempting to bind
>>>> an MPI process to a unique processor on node:
>>>>
>>>> Node: node18
>>>>
>>>> This usually means that you requested binding to more processors than
>>>> exist (e.g., trying to bind N MPI processes to M processors, where N>
>>>> M), or that the node has an unexpectedly different topology.
>>>>
>>>> Double check that you have enough unique processors for all the
>>>> MPI processes that you are launching on this host, and that all nodes
>>>> have identical topologies.
>>>>
>>>> You job will now abort.
>>>> --------------------------------------------------------------------------
>>>>
>>>> Oddly enough, the binding map *is* shown on STDERR,
>>>> and it sounds *correct*, pretty much the same binding map above
>>>> that I get for a single node.
>>>>
>>>> *****************
>>>>
>>>> Finally, replacing "--cpus-per-core 2" by "--npernode 16"
>>>> works to some extent, but doesn't reach my goal.
>>>> I.e., the job doesn't fail, and each node gets 16 MPI
>>>> processes indeed.
>>>> However, it doesn't bind the processes the way I want.
>>>> Regardless of whether I continue to use "--bind-to-core"
>>>> or replace it by "--bind-to-socket"
>>>> all 16 processes on each node always bind to socket 0,
>>>> and nothing goes to socket 1.
>>>>
>>>> ************
>>>>
>>>> Is there any simple workaround to this
>>>> (other than using a --rankfile),
>>>> to make --cpus-per-proc work with multiple nodes,
>>>> using "each other core"?
>>>>
>>>> [Only if it is simple workaround. I must finish this
>>>> remote test soon. Otherwise I can revisit this issue later.]
>>>>
>>>> Thank you,
>>>> Gus Correa
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users