Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] (OpenMPI for Cray XE6 ) How to set mca parameters through aprun?
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2013-11-25 15:11:29


Hmm, this seems like either a bug in qsub (torque is full of serious bugs) or a bug
in alps. I got an allocation using that command and alps only sees 1 node:

[ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS configuration file: "/etc/sysconfig/alps"
[ct-login1.localdomain:06010] ras:alps:allocate: parser_ini
[ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS configuration file: "/etc/alps.conf"
[ct-login1.localdomain:06010] ras:alps:allocate: parser_separated_columns
[ct-login1.localdomain:06010] ras:alps:allocate: Located ALPS scheduler file: "/ufs/alps_shared/appinfo"
[ct-login1.localdomain:06010] ras:alps:orte_ras_alps_get_appinfo_attempts: 10
[ct-login1.localdomain:06010] ras:alps:allocate: begin processing appinfo file
[ct-login1.localdomain:06010] ras:alps:allocate: file /ufs/alps_shared/appinfo read
[ct-login1.localdomain:06010] ras:alps:allocate: 47 entries in file
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3492 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3492 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3541 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3541 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3560 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3560 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3561 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3561 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3566 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3566 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3573 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3573 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3588 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3588 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3598 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3598 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3599 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3599 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3622 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3622 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3635 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3635 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3640 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3640 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3641 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3641 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3642 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3642 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3647 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3647 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3651 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3651 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3653 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3653 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3659 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3659 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3662 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3662 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3665 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3665 - myId 3668
[ct-login1.localdomain:06010] ras:alps:allocate: read data for resId 3668 - myId 3668
[ct-login1.localdomain:06010] ras:alps:read_appinfo(modern): processing NID 29 with 16 slots
[ct-login1.localdomain:06010] ras:alps:allocate: success
[ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert inserting 1 nodes
[ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert node 29

====================== ALLOCATED NODES ======================

 Data for node: 29 Num slots: 16 Max slots: 0

=================================================================

Torque also shows only one node with 16 PPN:

$ env | grep PBS
...
PBS_NUM_PPN=16

$ cat /var/spool/torque/aux//915289.sdb
login1

Which is wrong! I will have to ask Cray what is going on here. I recommend you switch to
msub to get an allocation. Moab has fewer bugs. I can't even get aprun to work:

$ aprun -n 2 -N 1 hostname
apsched: claim exceeds reservation's node-count

$ aprun -n 32 hostname
apsched: claim exceeds reservation's node-count

To get an interactive session 2 nodes with 16 ppn on each run:

msub -I -lnodes=2:ppn=16

Open MPI should then work correctly.

-Nathan Hjelm
HPC-5, LANL

On Sat, Nov 23, 2013 at 10:13:26PM +0000, Teranishi, Keita wrote:
> Hi,
> I installed OpenMPI on our small XE6 using the configure options under
> /contrib directory. It appears it is working fine, but it ignores MCA
> parameters (set in env var). So I switched to mpirun (in OpenMPI) and it
> can handle MCA parameters somehow. However, mpirun fails to allocate
> process by cores. For example, I allocated 32 cores (on 2 nodes) by "qsub
> -lmppwidth=32 -lmppnppn=16", mpirun recognizes it as 2 slots. Is it
> possible to mpirun to handle mluticore nodes of XE6 properly or is there
> any options to handle MCA parameters for aprun?
> Regards,
> -----------------------------------------------------------------------------
> Keita Teranishi
> Principal Member of Technical Staff
> Scalable Modeling and Analysis Systems
> Sandia National Laboratories
> Livermore, CA 94551
> +1 (925) 294-3738

> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users



  • application/pgp-signature attachment: stored