Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Rankfile problem with Open MPI 1.4.3
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-07-26 17:25:22


I normally hide my eyes when rankfiles appear, but since you provide so much help on this list yourself... :-)

I believe the problem is that you have the keyword "slots" wrong - it is supposed to be "slot":

    rank 1=host1 slot=1:0,1
    rank 0=host2 slot=0:*
    rank 2=host4 slot=1-2
    rank 3=host3 slot=0:1,1:0-2

Hence the flex parser gets confused...

I didn't write this code, but it seems to me that a little more leeway (e.g., allowing "slots" as well as "slot") would be more appropriate. If you try the revision and it works, I'll submit a change to accept both syntax options.

On Jul 26, 2011, at 2:49 PM, Gus Correa wrote:

> Dear Open MPI pros
>
> I am having trouble to get the mpiexec rankfile option right.
> I would appreciate any help to solve the problem.
>
> Also is there a way to tell Open MPI to print out its own numbering
> of the "slots", and perhaps how they're mapped to the socket:core pair?
>
> I am using Open MPI 1.4.3, compiled with Torque 2.4.11 support,
> on Linux CentOS 5.2 x86_64.
> This cluster has nodes with dual AMD Opteron quad-core processors,
> a total of 8 cores per node.
> I enclose a snippet of /proc/cpuinfo below.
>
> I build the rankfile on the fly from the $PBS_NODEFILE.
> The mpiexec command line is:
>
> mpiexec \
> -v \
> -np ${NP} \
> -mca btl openib,sm,self \
> -tag-output \
> -report-bindings \
> -rf $my_rf \
> -mca paffinity_base_verbose 1 \
> connectivity_c -v
>
>
> I tried two different ways to specify the slots on the rankfile:
>
> *First way (sequential "slots" on each node):
>
> rank 0=node34 slots=0
> rank 1=node34 slots=1
> rank 2=node34 slots=2
> rank 3=node34 slots=3
> rank 4=node34 slots=4
> rank 5=node34 slots=5
> rank 6=node34 slots=6
> rank 7=node34 slots=7
> rank 8=node33 slots=0
> rank 9=node33 slots=1
> rank 10=node33 slots=2
> rank 11=node33 slots=3
> rank 12=node33 slots=4
> rank 13=node33 slots=5
> rank 14=node33 slots=6
> rank 15=node33 slots=7
>
>
> *Second way ( slots in socket:core style) :
>
> rank 0=node34 slots=0:0
> rank 1=node34 slots=0:1
> rank 2=node34 slots=0:2
> rank 3=node34 slots=0:3
> rank 4=node34 slots=1:0
> rank 5=node34 slots=1:1
> rank 6=node34 slots=1:2
> rank 7=node34 slots=1:3
> rank 8=node33 slots=0:0
> rank 9=node33 slots=0:1
> rank 10=node33 slots=0:2
> rank 11=node33 slots=0:3
> rank 12=node33 slots=1:0
> rank 13=node33 slots=1:1
> rank 14=node33 slots=1:2
> rank 15=node33 slots=1:3
>
> ***
>
> I get the errors messages below.
> I am scratching my head to full baldness to try to understand them.
>
> They seem to suggest that my rankfile syntax is wrong
> (which I copied from the FAQ and man mpiexec), or that it is not parsing it as I expected it to be.
> Or is it perhaps that it doesn't like the numbers I am using for the
> various slots in the rankfile?
> The error messages also complaint about
> node allocation or oversubscribed slots,
> but the nodes were allocated by Torque, and the rankfiles were
> written with no intent to oversubscribe.
>
> *First rankfile error:
>
> --------------------------------------------------------------------------
> Rankfile claimed host 0 that was not allocated or oversubscribed it's slots.
> Please review your rank-slot assignments and your host allocation to ensure
> a proper match.
>
> --------------------------------------------------------------------------
> -
>
> ... etc, etc ...
>
> *Second rankfile error:
>
> --------------------------------------------------------------------------
> Rankfile claimed host 0:0 that was not allocated or oversubscribed it's slots.
> Please review your rank-slot assignments and your host allocation to ensure
> a proper match.
>
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
> launch so we are aborting.
>
> ... etc, etc ...
>
> **********
>
> I am stuck.
> Any help is much appreciated.
> Thank you.
>
> Gus Correa
>
>
>
> *****************************
> Snippet of /proc/cpuinfo
> *****************************
>
> processor : 0
> physical id : 0
> core id : 0
> siblings : 4
> cpu cores : 4
>
> processor : 1
> physical id : 0
> core id : 1
> siblings : 4
> cpu cores : 4
>
> processor : 2
> physical id : 0
> core id : 2
> siblings : 4
> cpu cores : 4
>
> processor : 3
> physical id : 0
> core id : 3
> siblings : 4
> cpu cores : 4
>
> processor : 4
> physical id : 1
> core id : 0
> siblings : 4
> cpu cores : 4
>
> processor : 5
> physical id : 1
> core id : 1
> siblings : 4
> cpu cores : 4
>
> processor : 6
> physical id : 1
> core id : 2
> siblings : 4
> cpu cores : 4
>
> processor : 7
> physical id : 1
> core id : 3
> siblings : 4
> cpu cores : 4
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users