Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [EXTERNAL] Re: (OpenMPI for Cray XE6 ) How to set mca parameters through aprun?
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2013-11-25 15:55:28


Ok, that should have worked. I just double-checked it to me sure.

ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$ mpirun -np 32 ./bcast
App launch reported: 17 (out of 3) daemons - 0 (out of 32) procs
ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$

How did you configure Open MPI and what version are you using?

-Nathan

On Mon, Nov 25, 2013 at 08:48:09PM +0000, Teranishi, Keita wrote:
> Hi Natan,
>
> I tried qsub option you
>
> mpirun -np 4 --mca plm_base_strip_prefix_from_node_names= 0 ./cpi
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 4 slots
> that were requested by the application:
> ./cpi
>
> Either request fewer slots for your application, or make more slots
> available
> for use.
> --------------------------------------------------------------------------
>
>
> Here is I got from aprun
> aprun -n 32 ./cpi
> Process 8 of 32 is on nid00011
> Process 5 of 32 is on nid00011
> Process 12 of 32 is on nid00011
> Process 9 of 32 is on nid00011
> Process 11 of 32 is on nid00011
> Process 13 of 32 is on nid00011
> Process 0 of 32 is on nid00011
> Process 6 of 32 is on nid00011
> Process 3 of 32 is on nid00011
> :
>
> :
>
> Also, I found a strange error in the end of the program (MPI_Finalize?)
> Can you tell me what is wrong with that?
> [nid00010:23511] [ 0] /lib64/libpthread.so.0(+0xf7c0) [0x2aaaacbbb7c0]
> [nid00010:23511] [ 1]
> /home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_int_free+0x57)
> [0x2aaaaaf38ec7]
> [nid00010:23511] [ 2]
> /home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_free+0xc3)
> [0x2aaaaaf3b6c3]
> [nid00010:23511] [ 3]
> /home/knteran/openmpi/lib/libmpi.so.0(mca_pml_base_close+0xb2)
> [0x2aaaaae717b2]
> [nid00010:23511] [ 4]
> /home/knteran/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x333)
> [0x2aaaaad7be23]
> [nid00010:23511] [ 5] ./cpi() [0x400e23]
> [nid00010:23511] [ 6] /lib64/libc.so.6(__libc_start_main+0xe6)
> [0x2aaaacde7c36]
> [nid00010:23511] [ 7] ./cpi() [0x400b09]
>
>
>
> Thanks,
>
> ---------------------------------------------------------------------------
> --
> Keita Teranishi
>
> Principal Member of Technical Staff
> Scalable Modeling and Analysis Systems
> Sandia National Laboratories
> Livermore, CA 94551
> +1 (925) 294-3738
>
>
>
>
>
> On 11/25/13 12:28 PM, "Nathan Hjelm" <hjelmn_at_[hidden]> wrote:
>
> >Just talked with our local Cray rep. Sounds like that torque syntax is
> >broken. You can continue
> >to use qsub (though qsub use is strongly discouraged) if you use the msub
> >options.
> >
> >Ex:
> >
> >qsub -lnodes=2:ppn=16
> >
> >Works.
> >
> >-Nathan
> >
> >On Mon, Nov 25, 2013 at 01:11:29PM -0700, Nathan Hjelm wrote:
> >> Hmm, this seems like either a bug in qsub (torque is full of serious
> >>bugs) or a bug
> >> in alps. I got an allocation using that command and alps only sees 1
> >>node:
> >>
> >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS
> >>configuration file: "/etc/sysconfig/alps"
> >> [ct-login1.localdomain:06010] ras:alps:allocate: parser_ini
> >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS
> >>configuration file: "/etc/alps.conf"
> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>parser_separated_columns
> >> [ct-login1.localdomain:06010] ras:alps:allocate: Located ALPS scheduler
> >>file: "/ufs/alps_shared/appinfo"
> >> [ct-login1.localdomain:06010]
> >>ras:alps:orte_ras_alps_get_appinfo_attempts: 10
> >> [ct-login1.localdomain:06010] ras:alps:allocate: begin processing
> >>appinfo file
> >> [ct-login1.localdomain:06010] ras:alps:allocate: file
> >>/ufs/alps_shared/appinfo read
> >> [ct-login1.localdomain:06010] ras:alps:allocate: 47 entries in file
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3492 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3492 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3541 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3541 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3560 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3560 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3561 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3561 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3566 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3566 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3573 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3573 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3588 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3588 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3598 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3598 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3599 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3599 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3622 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3622 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3635 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3635 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3640 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3640 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3641 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3641 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3642 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3642 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3647 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3647 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3651 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3651 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3653 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3653 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3659 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3659 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3662 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3662 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3665 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3665 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3668 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:read_appinfo(modern): processing
> >>NID 29 with 16 slots
> >> [ct-login1.localdomain:06010] ras:alps:allocate: success
> >> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert
> >>inserting 1 nodes
> >> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert node 29
> >>
> >> ====================== ALLOCATED NODES ======================
> >>
> >> Data for node: 29 Num slots: 16 Max slots: 0
> >>
> >> =================================================================
> >>
> >>
> >> Torque also shows only one node with 16 PPN:
> >>
> >> $ env | grep PBS
> >> ...
> >> PBS_NUM_PPN=16
> >>
> >>
> >> $ cat /var/spool/torque/aux//915289.sdb
> >> login1
> >>
> >> Which is wrong! I will have to ask Cray what is going on here. I
> >>recommend you switch to
> >> msub to get an allocation. Moab has fewer bugs. I can't even get aprun
> >>to work:
> >>
> >> $ aprun -n 2 -N 1 hostname
> >> apsched: claim exceeds reservation's node-count
> >>
> >> $ aprun -n 32 hostname
> >> apsched: claim exceeds reservation's node-count
> >>
> >>
> >> To get an interactive session 2 nodes with 16 ppn on each run:
> >>
> >> msub -I -lnodes=2:ppn=16
> >>
> >> Open MPI should then work correctly.
> >>
> >> -Nathan Hjelm
> >> HPC-5, LANL
> >>
> >> On Sat, Nov 23, 2013 at 10:13:26PM +0000, Teranishi, Keita wrote:
> >> > Hi,
> >> > I installed OpenMPI on our small XE6 using the configure options
> >>under
> >> > /contrib directory. It appears it is working fine, but it ignores
> >>MCA
> >> > parameters (set in env var). So I switched to mpirun (in OpenMPI)
> >>and it
> >> > can handle MCA parameters somehow. However, mpirun fails to
> >>allocate
> >> > process by cores. For example, I allocated 32 cores (on 2 nodes)
> >>by "qsub
> >> > -lmppwidth=32 -lmppnppn=16", mpirun recognizes it as 2 slots.
> >>Is it
> >> > possible to mpirun to handle mluticore nodes of XE6 properly or is
> >>there
> >> > any options to handle MCA parameters for aprun?
> >> > Regards,
> >> >
> >>-------------------------------------------------------------------------
> >>----
> >> > Keita Teranishi
> >> > Principal Member of Technical Staff
> >> > Scalable Modeling and Analysis Systems
> >> > Sandia National Laboratories
> >> > Livermore, CA 94551
> >> > +1 (925) 294-3738
> >>
> >> > _______________________________________________
> >> > users mailing list
> >> > users_at_[hidden]
> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> >
> >
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users



  • application/pgp-signature attachment: stored