Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Rankfile problem with Open MPI 1.4.3
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-07-26 20:47:25


On Jul 26, 2011, at 3:56 PM, Gus Correa wrote:

> Thank you very much, Ralph.
>
> Heck, it had to be something stupid like this.
> Sorry for taking your time.
> Yes, switching from "slots" to "slot" fixes the rankfile problem,
> and both cases work.
>
> I must have been carried along by the hostfile syntax,
> where the "slots" reign, but when it comes to binding,
> obviously for each process rank one wants a single "slot"
> (unless the process is multi-threaded, which is what I need to setup).
>
> I will write 100 times in the blackboard:
> "Slots in the hostfile, slot in the rankfile,
> slot is singular, to err is plural."

LOL

> ... at least until Ralph's new plural-forgiving parsing rule
> makes it to the code.

Committed to the trunk, in the queue for both 1.4.4 and 1.5.4.

>
> Regards,
> Gus Correa
>
>
>
>
> Ralph Castain wrote:
>> I normally hide my eyes when rankfiles appear,
> but since you provide so much help on this list yourself... :-)
>> I believe the problem is that you have the keyword "slots" wrong -
> it is supposed to be "slot":
>> rank 1=host1 slot=1:0,1
>> rank 0=host2 slot=0:*
>> rank 2=host4 slot=1-2
>> rank 3=host3 slot=0:1,1:0-2
>> Hence the flex parser gets confused...
>> I didn't write this code, but it seems to me that a little more leeway (e.g., allowing "slots" as well as "slot") would be more appropriate. If you try the revision and it works, I'll submit a change to accept both syntax options.
>> On Jul 26, 2011, at 2:49 PM, Gus Correa wrote:
>>> Dear Open MPI pros
>>>
>>> I am having trouble to get the mpiexec rankfile option right.
>>> I would appreciate any help to solve the problem.
>>>
>>> Also is there a way to tell Open MPI to print out its own numbering
>>> of the "slots", and perhaps how they're mapped to the socket:core pair?
>>>
>>> I am using Open MPI 1.4.3, compiled with Torque 2.4.11 support,
>>> on Linux CentOS 5.2 x86_64.
>>> This cluster has nodes with dual AMD Opteron quad-core processors,
>>> a total of 8 cores per node.
>>> I enclose a snippet of /proc/cpuinfo below.
>>>
>>> I build the rankfile on the fly from the $PBS_NODEFILE.
>>> The mpiexec command line is:
>>>
>>> mpiexec \
>>> -v \
>>> -np ${NP} \
>>> -mca btl openib,sm,self \
>>> -tag-output \
>>> -report-bindings \
>>> -rf $my_rf \
>>> -mca paffinity_base_verbose 1 \
>>> connectivity_c -v
>>>
>>>
>>> I tried two different ways to specify the slots on the rankfile:
>>>
>>> *First way (sequential "slots" on each node):
>>>
>>> rank 0=node34 slots=0
>>> rank 1=node34 slots=1
>>> rank 2=node34 slots=2
>>> rank 3=node34 slots=3
>>> rank 4=node34 slots=4
>>> rank 5=node34 slots=5
>>> rank 6=node34 slots=6
>>> rank 7=node34 slots=7
>>> rank 8=node33 slots=0
>>> rank 9=node33 slots=1
>>> rank 10=node33 slots=2
>>> rank 11=node33 slots=3
>>> rank 12=node33 slots=4
>>> rank 13=node33 slots=5
>>> rank 14=node33 slots=6
>>> rank 15=node33 slots=7
>>>
>>>
>>> *Second way ( slots in socket:core style) :
>>>
>>> rank 0=node34 slots=0:0
>>> rank 1=node34 slots=0:1
>>> rank 2=node34 slots=0:2
>>> rank 3=node34 slots=0:3
>>> rank 4=node34 slots=1:0
>>> rank 5=node34 slots=1:1
>>> rank 6=node34 slots=1:2
>>> rank 7=node34 slots=1:3
>>> rank 8=node33 slots=0:0
>>> rank 9=node33 slots=0:1
>>> rank 10=node33 slots=0:2
>>> rank 11=node33 slots=0:3
>>> rank 12=node33 slots=1:0
>>> rank 13=node33 slots=1:1
>>> rank 14=node33 slots=1:2
>>> rank 15=node33 slots=1:3
>>>
>>> ***
>>>
>>> I get the errors messages below.
>>> I am scratching my head to full baldness to try to understand them.
>>>
>>> They seem to suggest that my rankfile syntax is wrong
>>> (which I copied from the FAQ and man mpiexec), or that it is not parsing it as I expected it to be.
>>> Or is it perhaps that it doesn't like the numbers I am using for the
>>> various slots in the rankfile?
>>> The error messages also complaint about
>>> node allocation or oversubscribed slots,
>>> but the nodes were allocated by Torque, and the rankfiles were
>>> written with no intent to oversubscribe.
>>>
>>> *First rankfile error:
>>>
>>> --------------------------------------------------------------------------
>>> Rankfile claimed host 0 that was not allocated or oversubscribed it's slots.
>>> Please review your rank-slot assignments and your host allocation to ensure
>>> a proper match.
>>>
>>> --------------------------------------------------------------------------
>>> -
>>>
>>> ... etc, etc ...
>>>
>>> *Second rankfile error:
>>>
>>> --------------------------------------------------------------------------
>>> Rankfile claimed host 0:0 that was not allocated or oversubscribed it's slots.
>>> Please review your rank-slot assignments and your host allocation to ensure
>>> a proper match.
>>>
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
>>> launch so we are aborting.
>>>
>>> ... etc, etc ...
>>>
>>> **********
>>>
>>> I am stuck.
>>> Any help is much appreciated.
>>> Thank you.
>>>
>>> Gus Correa
>>>
>>>
>>>
>>> *****************************
>>> Snippet of /proc/cpuinfo
>>> *****************************
>>>
>>> processor : 0
>>> physical id : 0
>>> core id : 0
>>> siblings : 4
>>> cpu cores : 4
>>>
>>> processor : 1
>>> physical id : 0
>>> core id : 1
>>> siblings : 4
>>> cpu cores : 4
>>>
>>> processor : 2
>>> physical id : 0
>>> core id : 2
>>> siblings : 4
>>> cpu cores : 4
>>>
>>> processor : 3
>>> physical id : 0
>>> core id : 3
>>> siblings : 4
>>> cpu cores : 4
>>>
>>> processor : 4
>>> physical id : 1
>>> core id : 0
>>> siblings : 4
>>> cpu cores : 4
>>>
>>> processor : 5
>>> physical id : 1
>>> core id : 1
>>> siblings : 4
>>> cpu cores : 4
>>>
>>> processor : 6
>>> physical id : 1
>>> core id : 2
>>> siblings : 4
>>> cpu cores : 4
>>>
>>> processor : 7
>>> physical id : 1
>>> core id : 3
>>> siblings : 4
>>> cpu cores : 4
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users