Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Rankfile problem with Open MPI 1.4.3
From: Gus Correa (gus_at_[hidden])
Date: 2011-07-26 17:56:50


Thank you very much, Ralph.

Heck, it had to be something stupid like this.
Sorry for taking your time.
Yes, switching from "slots" to "slot" fixes the rankfile problem,
and both cases work.

I must have been carried along by the hostfile syntax,
where the "slots" reign, but when it comes to binding,
obviously for each process rank one wants a single "slot"
(unless the process is multi-threaded, which is what I need to setup).

I will write 100 times in the blackboard:
"Slots in the hostfile, slot in the rankfile,
slot is singular, to err is plural."
... at least until Ralph's new plural-forgiving parsing rule
makes it to the code.

Regards,
Gus Correa

Ralph Castain wrote:
> I normally hide my eyes when rankfiles appear,
but since you provide so much help on this list yourself... :-)
>
> I believe the problem is that you have the keyword "slots" wrong -
it is supposed to be "slot":
>
> rank 1=host1 slot=1:0,1
> rank 0=host2 slot=0:*
> rank 2=host4 slot=1-2
> rank 3=host3 slot=0:1,1:0-2
>
> Hence the flex parser gets confused...
>
> I didn't write this code, but
> it seems to me that a little more leeway
> (e.g., allowing "slots" as well as "slot")
> would be more appropriate. If you try the revision and it works,
> I'll submit a change to accept both syntax options.
>
> On Jul 26, 2011, at 2:49 PM, Gus Correa wrote:
>
>> Dear Open MPI pros
>>
>> I am having trouble to get the mpiexec rankfile option right.
>> I would appreciate any help to solve the problem.
>>
>> Also is there a way to tell Open MPI to print out its own numbering
>> of the "slots", and perhaps how they're mapped to the socket:core pair?
>>
>> I am using Open MPI 1.4.3, compiled with Torque 2.4.11 support,
>> on Linux CentOS 5.2 x86_64.
>> This cluster has nodes with dual AMD Opteron quad-core processors,
>> a total of 8 cores per node.
>> I enclose a snippet of /proc/cpuinfo below.
>>
>> I build the rankfile on the fly from the $PBS_NODEFILE.
>> The mpiexec command line is:
>>
>> mpiexec \
>> -v \
>> -np ${NP} \
>> -mca btl openib,sm,self \
>> -tag-output \
>> -report-bindings \
>> -rf $my_rf \
>> -mca paffinity_base_verbose 1 \
>> connectivity_c -v
>>
>>
>> I tried two different ways to specify the slots on the rankfile:
>>
>> *First way (sequential "slots" on each node):
>>
>> rank 0=node34 slots=0
>> rank 1=node34 slots=1
>> rank 2=node34 slots=2
>> rank 3=node34 slots=3
>> rank 4=node34 slots=4
>> rank 5=node34 slots=5
>> rank 6=node34 slots=6
>> rank 7=node34 slots=7
>> rank 8=node33 slots=0
>> rank 9=node33 slots=1
>> rank 10=node33 slots=2
>> rank 11=node33 slots=3
>> rank 12=node33 slots=4
>> rank 13=node33 slots=5
>> rank 14=node33 slots=6
>> rank 15=node33 slots=7
>>
>>
>> *Second way ( slots in socket:core style) :
>>
>> rank 0=node34 slots=0:0
>> rank 1=node34 slots=0:1
>> rank 2=node34 slots=0:2
>> rank 3=node34 slots=0:3
>> rank 4=node34 slots=1:0
>> rank 5=node34 slots=1:1
>> rank 6=node34 slots=1:2
>> rank 7=node34 slots=1:3
>> rank 8=node33 slots=0:0
>> rank 9=node33 slots=0:1
>> rank 10=node33 slots=0:2
>> rank 11=node33 slots=0:3
>> rank 12=node33 slots=1:0
>> rank 13=node33 slots=1:1
>> rank 14=node33 slots=1:2
>> rank 15=node33 slots=1:3
>>
>> ***
>>
>> I get the errors messages below.
>> I am scratching my head to full baldness to try to understand them.
>>
>> They seem to suggest that my rankfile syntax is wrong
>> (which I copied from the FAQ and man mpiexec), or that it is not parsing it as I expected it to be.
>> Or is it perhaps that it doesn't like the numbers I am using for the
>> various slots in the rankfile?
>> The error messages also complaint about
>> node allocation or oversubscribed slots,
>> but the nodes were allocated by Torque, and the rankfiles were
>> written with no intent to oversubscribe.
>>
>> *First rankfile error:
>>
>> --------------------------------------------------------------------------
>> Rankfile claimed host 0 that was not allocated or oversubscribed it's slots.
>> Please review your rank-slot assignments and your host allocation to ensure
>> a proper match.
>>
>> --------------------------------------------------------------------------
>> -
>>
>> ... etc, etc ...
>>
>> *Second rankfile error:
>>
>> --------------------------------------------------------------------------
>> Rankfile claimed host 0:0 that was not allocated or oversubscribed it's slots.
>> Please review your rank-slot assignments and your host allocation to ensure
>> a proper match.
>>
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
>> launch so we are aborting.
>>
>> ... etc, etc ...
>>
>> **********
>>
>> I am stuck.
>> Any help is much appreciated.
>> Thank you.
>>
>> Gus Correa
>>
>>
>>
>> *****************************
>> Snippet of /proc/cpuinfo
>> *****************************
>>
>> processor : 0
>> physical id : 0
>> core id : 0
>> siblings : 4
>> cpu cores : 4
>>
>> processor : 1
>> physical id : 0
>> core id : 1
>> siblings : 4
>> cpu cores : 4
>>
>> processor : 2
>> physical id : 0
>> core id : 2
>> siblings : 4
>> cpu cores : 4
>>
>> processor : 3
>> physical id : 0
>> core id : 3
>> siblings : 4
>> cpu cores : 4
>>
>> processor : 4
>> physical id : 1
>> core id : 0
>> siblings : 4
>> cpu cores : 4
>>
>> processor : 5
>> physical id : 1
>> core id : 1
>> siblings : 4
>> cpu cores : 4
>>
>> processor : 6
>> physical id : 1
>> core id : 2
>> siblings : 4
>> cpu cores : 4
>>
>> processor : 7
>> physical id : 1
>> core id : 3
>> siblings : 4
>> cpu cores : 4
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users