Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] (OpenMPI for Cray XE6 ) How to set mca parameters through aprun?
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2013-11-25 15:28:23


Just talked with our local Cray rep. Sounds like that torque syntax is broken. You can continue
to use qsub (though qsub use is strongly discouraged) if you use the msub options.

Ex:

qsub -lnodes=2:ppn=16

Works.

-Nathan

On Mon, Nov 25, 2013 at 01:11:29PM -0700, Nathan Hjelm wrote:
> Hmm, this seems like either a bug in qsub (torque is full of serious bugs) or a bug
> in alps. I got an allocation using that command and alps only sees 1 node:
>
> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS configuration file: "/etc/sysconfig/alps"
> [ct-login1.localdomain:06010] ras:alps:allocate: parser_ini
> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS configuration file: "/etc/alps.conf"
> [ct-login1.localdomain:06010] ras:alps:allocate: parser_separated_columns
> [ct-login1.localdomain:06010] ras:alps:allocate: Located ALPS scheduler file: "/ufs/alps_shared/appinfo"
> [ct-login1.localdomain:06010] ras:alps:orte_ras_alps_get_appinfo_attempts: 10
> [ct-login1.localdomain:06010] ras:alps:allocate: begin processing appinfo file
> [ct-login1.localdomain:06010] ras:alps:allocate: file /ufs/alps_shared/appinfo read
> [ct-login1.localdomain:06010] ras:alps:allocate: 47 entries in file
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3492 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3492 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3541 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3541 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3560 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3560 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3561 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3561 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3566 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3566 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3573 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3573 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3588 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3588 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3598 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3598 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3599 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3599 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3622 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3622 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3635 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3635 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3640 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3640 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3641 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3641 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3642 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3642 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3647 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3647 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3651 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3651 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3653 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3653 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3659 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3659 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3662 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3662 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3665 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3665 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3668 - myId 3668
> [ct-login1.localdomain:06010] ras:alps:read_appinfo(modern): processing NID 29 with 16 slots
> [ct-login1.localdomain:06010] ras:alps:allocate: success
> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert inserting 1 nodes
> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert node 29
>
> ====================== ALLOCATED NODES ======================
>
> Data for node: 29 Num slots: 16 Max slots: 0
>
> =================================================================
>
>
> Torque also shows only one node with 16 PPN:
>
> $ env | grep PBS
> ...
> PBS_NUM_PPN=16
>
>
> $ cat /var/spool/torque/aux//915289.sdb
> login1
>
> Which is wrong! I will have to ask Cray what is going on here. I recommend you switch to
> msub to get an allocation. Moab has fewer bugs. I can't even get aprun to work:
>
> $ aprun -n 2 -N 1 hostname
> apsched: claim exceeds reservation's node-count
>
> $ aprun -n 32 hostname
> apsched: claim exceeds reservation's node-count
>
>
> To get an interactive session 2 nodes with 16 ppn on each run:
>
> msub -I -lnodes=2:ppn=16
>
> Open MPI should then work correctly.
>
> -Nathan Hjelm
> HPC-5, LANL
>
> On Sat, Nov 23, 2013 at 10:13:26PM +0000, Teranishi, Keita wrote:
> > Hi,
> > I installed OpenMPI on our small XE6 using the configure options under
> > /contrib directory. It appears it is working fine, but it ignores MCA
> > parameters (set in env var). So I switched to mpirun (in OpenMPI) and it
> > can handle MCA parameters somehow. However, mpirun fails to allocate
> > process by cores. For example, I allocated 32 cores (on 2 nodes) by "qsub
> > -lmppwidth=32 -lmppnppn=16", mpirun recognizes it as 2 slots. Is it
> > possible to mpirun to handle mluticore nodes of XE6 properly or is there
> > any options to handle MCA parameters for aprun?
> > Regards,
> > -----------------------------------------------------------------------------
> > Keita Teranishi
> > Principal Member of Technical Staff
> > Scalable Modeling and Analysis Systems
> > Sandia National Laboratories
> > Livermore, CA 94551
> > +1 (925) 294-3738
>
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>

> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users



  • application/pgp-signature attachment: stored