Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [EXTERNAL] Re: (OpenMPI for Cray XE6 ) How to set mca parameters through aprun?
From: Teranishi, Keita (knteran_at_[hidden])
Date: 2013-11-25 15:48:09


Hi Natan,

I tried qsub option you

mpirun -np 4 --mca plm_base_strip_prefix_from_node_names= 0 ./cpi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
  ./cpi

Either request fewer slots for your application, or make more slots
available
for use.
--------------------------------------------------------------------------

Here is I got from aprun
aprun -n 32 ./cpi
Process 8 of 32 is on nid00011
Process 5 of 32 is on nid00011
Process 12 of 32 is on nid00011
Process 9 of 32 is on nid00011
Process 11 of 32 is on nid00011
Process 13 of 32 is on nid00011
Process 0 of 32 is on nid00011
Process 6 of 32 is on nid00011
Process 3 of 32 is on nid00011
:

:

Also, I found a strange error in the end of the program (MPI_Finalize?)
Can you tell me what is wrong with that?
[nid00010:23511] [ 0] /lib64/libpthread.so.0(+0xf7c0) [0x2aaaacbbb7c0]
[nid00010:23511] [ 1]
/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_int_free+0x57)
[0x2aaaaaf38ec7]
[nid00010:23511] [ 2]
/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_free+0xc3)
[0x2aaaaaf3b6c3]
[nid00010:23511] [ 3]
/home/knteran/openmpi/lib/libmpi.so.0(mca_pml_base_close+0xb2)
[0x2aaaaae717b2]
[nid00010:23511] [ 4]
/home/knteran/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x333)
[0x2aaaaad7be23]
[nid00010:23511] [ 5] ./cpi() [0x400e23]
[nid00010:23511] [ 6] /lib64/libc.so.6(__libc_start_main+0xe6)
[0x2aaaacde7c36]
[nid00010:23511] [ 7] ./cpi() [0x400b09]

Thanks,

---------------------------------------------------------------------------

--
Keita Teranishi
Principal Member of Technical Staff
Scalable Modeling and Analysis Systems
Sandia National Laboratories
Livermore, CA 94551
+1 (925) 294-3738
On 11/25/13 12:28 PM, "Nathan Hjelm" <hjelmn_at_[hidden]> wrote:
>Just talked with our local Cray rep. Sounds like that torque syntax is
>broken. You can continue
>to use qsub (though qsub use is strongly discouraged) if you use the msub
>options.
>
>Ex:
>
>qsub -lnodes=2:ppn=16
>
>Works.
>
>-Nathan
>
>On Mon, Nov 25, 2013 at 01:11:29PM -0700, Nathan Hjelm wrote:
>> Hmm, this seems like either a bug in qsub (torque is full of serious
>>bugs) or a bug
>> in alps. I got an allocation using that command and alps only sees 1
>>node:
>> 
>> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS
>>configuration file: "/etc/sysconfig/alps"
>> [ct-login1.localdomain:06010] ras:alps:allocate: parser_ini
>> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS
>>configuration file: "/etc/alps.conf"
>> [ct-login1.localdomain:06010] ras:alps:allocate:
>>parser_separated_columns
>> [ct-login1.localdomain:06010] ras:alps:allocate: Located ALPS scheduler
>>file: "/ufs/alps_shared/appinfo"
>> [ct-login1.localdomain:06010]
>>ras:alps:orte_ras_alps_get_appinfo_attempts: 10
>> [ct-login1.localdomain:06010] ras:alps:allocate: begin processing
>>appinfo file
>> [ct-login1.localdomain:06010] ras:alps:allocate: file
>>/ufs/alps_shared/appinfo read
>> [ct-login1.localdomain:06010] ras:alps:allocate: 47 entries in file
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3492 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3492 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3541 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3541 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3560 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3560 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3561 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3561 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3566 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3566 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3573 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3573 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3588 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3588 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3598 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3598 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3599 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3599 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3622 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3622 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3635 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3635 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3640 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3640 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3641 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3641 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3642 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3642 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3647 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3647 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3651 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3651 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3653 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3653 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3659 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3659 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3662 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3662 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3665 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3665 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
>>3668 - myId 3668
>> [ct-login1.localdomain:06010] ras:alps:read_appinfo(modern): processing
>>NID 29 with 16 slots
>> [ct-login1.localdomain:06010] ras:alps:allocate: success
>> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert
>>inserting 1 nodes
>> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert node 29
>> 
>> ======================   ALLOCATED NODES   ======================
>> 
>>  Data for node: 29	Num slots: 16	Max slots: 0
>> 
>> =================================================================
>> 
>> 
>> Torque also shows only one node with 16 PPN:
>> 
>> $ env | grep PBS
>> ...
>> PBS_NUM_PPN=16
>> 
>> 
>> $ cat /var/spool/torque/aux//915289.sdb
>> login1
>> 
>> Which is wrong! I will have to ask Cray what is going on here. I
>>recommend you switch to
>> msub to get an allocation. Moab has fewer bugs. I can't even get aprun
>>to work:
>> 
>> $ aprun -n 2 -N 1 hostname
>> apsched: claim exceeds reservation's node-count
>> 
>> $ aprun -n 32 hostname
>> apsched: claim exceeds reservation's node-count
>> 
>> 
>> To get an interactive session 2 nodes with 16 ppn on each run:
>> 
>> msub -I -lnodes=2:ppn=16
>> 
>> Open MPI should then work correctly.
>> 
>> -Nathan Hjelm
>> HPC-5, LANL
>> 
>> On Sat, Nov 23, 2013 at 10:13:26PM +0000, Teranishi, Keita wrote:
>> >    Hi,
>> >    I installed OpenMPI on our small XE6 using the configure options
>>under
>> >    /contrib directory.  It appears it is working fine, but it ignores
>>MCA
>> >    parameters (set in env var).  So I switched to mpirun (in OpenMPI)
>>and it
>> >    can handle MCA parameters somehow.  However,  mpirun fails to
>>allocate
>> >    process by cores.  For example, I allocated 32 cores (on 2 nodes)
>>by "qsub
>> >    -lmppwidth=32 -lmppnppn=16", mpirun recognizes it as 2 slots.
>>Is it
>> >    possible to mpirun to handle mluticore nodes of XE6 properly or is
>>there
>> >    any options to handle MCA parameters for aprun?
>> >    Regards,
>> >    
>>-------------------------------------------------------------------------
>>----
>> >    Keita Teranishi
>> >    Principal Member of Technical Staff
>> >    Scalable Modeling and Analysis Systems
>> >    Sandia National Laboratories
>> >    Livermore, CA 94551
>> >    +1 (925) 294-3738
>> 
>> > _______________________________________________
>> > users mailing list
>> > users_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>
>
>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>