Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Option -cpus-per-proc 2 not working with given machinefile?
From: Reuti (reuti_at_[hidden])
Date: 2013-02-28 11:40:19


Am 28.02.2013 um 17:29 schrieb Ralph Castain:

>
> On Feb 28, 2013, at 6:17 AM, Reuti <reuti_at_[hidden]> wrote:
>
>> Am 28.02.2013 um 08:58 schrieb Reuti:
>>
>>> Am 28.02.2013 um 06:55 schrieb Ralph Castain:
>>>
>>>> I don't off-hand see a problem, though I do note that your "working" version incorrectly reports the universe size as 2!
>>>
>>> Yes, it was 2 in the case when it was working by giving only two hostnames without any dedicated slot count. What should it be in this case - "unknown", "infinity"?
>>
>> As an add on:
>>
>> a) I tried it again on the command line and still get:
>>
>> Total: 64
>> Universe: 2
>>
>> with a hostfile
>>
>> node006
>> node007
>>
>
> My bad - since no slots were given, we default to a value of 1 for each node, so this is correct.
>
>>
>> b) In a job script under SGE and Open MPI compiled --with-sge I get after mangling the hostfile:
>>
>> #!/bin /sh
>> #$ -pe openmpi* 128
>> #$ -l exclusive
>> cut -f 1 -d" " $PE_HOSTFILE > $TMPDIR/machines
>> mpiexec -cpus-per-proc 2 -report-bindings -hostfile $TMPDIR/machines -np 64 ./mpihello
>>
>> Here:
>>
>> Total: 64
>> Universe: 128
>
> This would be correct as SGE is allocating a total of 128 slots (or pe's)

Yep, this is the case. But the hostfile I give in addition contains only the two hostnames (not slot count).

And if I don't supply this mangled file in addition, it won't startup but give the error:

--------------------------------------------------------------------------
An invalid physical processor ID was returned when attempting to bind
an MPI process to a unique processor.

This usually means that you requested binding to more processors than
exist (e.g., trying to bind N MPI processes to M processors, where N >
M). Double check that you have enough unique processors for all the
MPI processes that you are launching on this host.

You job will now abort.
--------------------------------------------------------------------------

What I just note: in this error there is no hostname given when running inside SGE. But there is one given if started from the command line like:

--------------------------------------------------------------------------
An invalid physical processor ID was returned when attempting to bind
an MPI process to a unique processor on node:

  Node: node006

This usually means that you requested binding to more processors than
exist (e.g., trying to bind N MPI processes to M processors, where N >
M), or that the node has an unexpectedly different topology.

Double check that you have enough unique processors for all the
MPI processes that you are launching on this host, and that all nodes
have identical topologies.

You job will now abort.
--------------------------------------------------------------------------

-- Reuti

>>
>> Maybe the found allocation by SGE and the one from the command line argument are getting mixed here.
>>
>> -- Reuti
>>
>>
>>> -- Reuti
>>>
>>>
>>>>
>>>> I'll have to take a look at this and get back to you on it.
>>>>
>>>> On Feb 27, 2013, at 3:15 PM, Reuti <reuti_at_[hidden]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have an issue using the option -cpus-per-proc 2. As I have Bulldozer machines and I want only one process per FP core, I thought using -cpus-per-proc 2 would be the way to go. Initially I had this issue inside GridEngine but then tried it outside any queuingsystem and face exactly the same behavior.
>>>>>
>>>>> @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 integer cores per machine in total. Used Open MPI is 1.6.4.
>>>>>
>>>>>
>>>>> a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 ./mpihello
>>>>>
>>>>> and a hostfile containing only the two lines listing the machines:
>>>>>
>>>>> node006
>>>>> node007
>>>>>
>>>>> This works as I would like it (see working.txt) when initiated on node006.
>>>>>
>>>>>
>>>>> b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 ./mpihello
>>>>>
>>>>> But changing the hostefile so that it is having a slot count which might mimic the behavior in case of a parsed machinefile out of any queuing system:
>>>>>
>>>>> node006 slots=64
>>>>> node007 slots=64
>>>>>
>>>>> This fails with:
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> An invalid physical processor ID was returned when attempting to bind
>>>>> an MPI process to a unique processor on node:
>>>>>
>>>>> Node: node006
>>>>>
>>>>> This usually means that you requested binding to more processors than
>>>>> exist (e.g., trying to bind N MPI processes to M processors, where N >
>>>>> M), or that the node has an unexpectedly different topology.
>>>>>
>>>>> Double check that you have enough unique processors for all the
>>>>> MPI processes that you are launching on this host, and that all nodes
>>>>> have identical topologies.
>>>>>
>>>>> You job will now abort.
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> (see failed.txt)
>>>>>
>>>>>
>>>>> b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 ./mpihello
>>>>>
>>>>> This works and the found universe is 128 as expected (see only32.txt).
>>>>>
>>>>>
>>>>> c) Maybe the used machinefile is not parsed in the correct way, so I checked:
>>>>>
>>>>> c1) mpiexec -hostfile machines -np 64 ./mpihello => works
>>>>>
>>>>> c2) mpiexec -hostfile machines -np 128 ./mpihello => works
>>>>>
>>>>> c3) mpiexec -hostfile machines -np 129 ./mpihello => fails as expected
>>>>>
>>>>> So, it got the slot counts in the correct way.
>>>>>
>>>>> What do I miss?
>>>>>
>>>>> -- Reuti
>>>>>
>>>>> <failed.txt><only32.txt><working.txt>_______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users