Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] [EXTERNAL] Re: (OpenMPI for Cray XE6 ) How to set mca parameters through aprun?
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2013-11-25 15:55:28


Ok, that should have worked. I just double-checked it to me sure.

ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$ mpirun -np 32 ./bcast
App launch reported: 17 (out of 3) daemons - 0 (out of 32) procs
ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$

How did you configure Open MPI and what version are you using?

-Nathan

On Mon, Nov 25, 2013 at 08:48:09PM +0000, Teranishi, Keita wrote:
> Hi Natan,
>
> I tried qsub option you
>
> mpirun -np 4 --mca plm_base_strip_prefix_from_node_names= 0 ./cpi
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 4 slots
> that were requested by the application:
> ./cpi
>
> Either request fewer slots for your application, or make more slots
> available
> for use.
> --------------------------------------------------------------------------
>
>
> Here is I got from aprun
> aprun -n 32 ./cpi
> Process 8 of 32 is on nid00011
> Process 5 of 32 is on nid00011
> Process 12 of 32 is on nid00011
> Process 9 of 32 is on nid00011
> Process 11 of 32 is on nid00011
> Process 13 of 32 is on nid00011
> Process 0 of 32 is on nid00011
> Process 6 of 32 is on nid00011
> Process 3 of 32 is on nid00011
> :
>
> :
>
> Also, I found a strange error in the end of the program (MPI_Finalize?)
> Can you tell me what is wrong with that?
> [nid00010:23511] [ 0] /lib64/libpthread.so.0(+0xf7c0) [0x2aaaacbbb7c0]
> [nid00010:23511] [ 1]
> /home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_int_free+0x57)
> [0x2aaaaaf38ec7]
> [nid00010:23511] [ 2]
> /home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_free+0xc3)
> [0x2aaaaaf3b6c3]
> [nid00010:23511] [ 3]
> /home/knteran/openmpi/lib/libmpi.so.0(mca_pml_base_close+0xb2)
> [0x2aaaaae717b2]
> [nid00010:23511] [ 4]
> /home/knteran/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x333)
> [0x2aaaaad7be23]
> [nid00010:23511] [ 5] ./cpi() [0x400e23]
> [nid00010:23511] [ 6] /lib64/libc.so.6(__libc_start_main+0xe6)
> [0x2aaaacde7c36]
> [nid00010:23511] [ 7] ./cpi() [0x400b09]
>
>
>
> Thanks,
>
> ---------------------------------------------------------------------------
> --
> Keita Teranishi
>
> Principal Member of Technical Staff
> Scalable Modeling and Analysis Systems
> Sandia National Laboratories
> Livermore, CA 94551
> +1 (925) 294-3738
>
>
>
>
>
> On 11/25/13 12:28 PM, "Nathan Hjelm" <hjelmn_at_[hidden]> wrote:
>
> >Just talked with our local Cray rep. Sounds like that torque syntax is
> >broken. You can continue
> >to use qsub (though qsub use is strongly discouraged) if you use the msub
> >options.
> >
> >Ex:
> >
> >qsub -lnodes=2:ppn=16
> >
> >Works.
> >
> >-Nathan
> >
> >On Mon, Nov 25, 2013 at 01:11:29PM -0700, Nathan Hjelm wrote:
> >> Hmm, this seems like either a bug in qsub (torque is full of serious
> >>bugs) or a bug
> >> in alps. I got an allocation using that command and alps only sees 1
> >>node:
> >>
> >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS
> >>configuration file: "/etc/sysconfig/alps"
> >> [ct-login1.localdomain:06010] ras:alps:allocate: parser_ini
> >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS
> >>configuration file: "/etc/alps.conf"
> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>parser_separated_columns
> >> [ct-login1.localdomain:06010] ras:alps:allocate: Located ALPS scheduler
> >>file: "/ufs/alps_shared/appinfo"
> >> [ct-login1.localdomain:06010]
> >>ras:alps:orte_ras_alps_get_appinfo_attempts: 10
> >> [ct-login1.localdomain:06010] ras:alps:allocate: begin processing
> >>appinfo file
> >> [ct-login1.localdomain:06010] ras:alps:allocate: file
> >>/ufs/alps_shared/appinfo read
> >> [ct-login1.localdomain:06010] ras:alps:allocate: 47 entries in file
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3492 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3492 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3541 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3541 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3560 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3560 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3561 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3561 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3566 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3566 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3573 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3573 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3588 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3588 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3598 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3598 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3599 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3599 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3622 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3622 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3635 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3635 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3640 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3640 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3641 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3641 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3642 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3642 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3647 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3647 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3651 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3651 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3653 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3653 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3659 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3659 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3662 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3662 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3665 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3665 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data for resId
> >>3668 - myId 3668
> >> [ct-login1.localdomain:06010] ras:alps:read_appinfo(modern): processing
> >>NID 29 with 16 slots
> >> [ct-login1.localdomain:06010] ras:alps:allocate: success
> >> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert
> >>inserting 1 nodes
> >> [ct-login1.localdomain:06010] [[15798,0],0] ras:base:node_insert node 29
> >>
> >> ====================== ALLOCATED NODES ======================
> >>
> >> Data for node: 29 Num slots: 16 Max slots: 0
> >>
> >> =================================================================
> >>
> >>
> >> Torque also shows only one node with 16 PPN:
> >>
> >> $ env | grep PBS
> >> ...
> >> PBS_NUM_PPN=16
> >>
> >>
> >> $ cat /var/spool/torque/aux//915289.sdb
> >> login1
> >>
> >> Which is wrong! I will have to ask Cray what is going on here. I
> >>recommend you switch to
> >> msub to get an allocation. Moab has fewer bugs. I can't even get aprun
> >>to work:
> >>
> >> $ aprun -n 2 -N 1 hostname
> >> apsched: claim exceeds reservation's node-count
> >>
> >> $ aprun -n 32 hostname
> >> apsched: claim exceeds reservation's node-count
> >>
> >>
> >> To get an interactive session 2 nodes with 16 ppn on each run:
> >>
> >> msub -I -lnodes=2:ppn=16
> >>
> >> Open MPI should then work correctly.
> >>
> >> -Nathan Hjelm
> >> HPC-5, LANL
> >>
> >> On Sat, Nov 23, 2013 at 10:13:26PM +0000, Teranishi, Keita wrote:
> >> > Hi,
> >> > I installed OpenMPI on our small XE6 using the configure options
> >>under
> >> > /contrib directory. It appears it is working fine, but it ignores
> >>MCA
> >> > parameters (set in env var). So I switched to mpirun (in OpenMPI)
> >>and it
> >> > can handle MCA parameters somehow. However, mpirun fails to
> >>allocate
> >> > process by cores. For example, I allocated 32 cores (on 2 nodes)
> >>by "qsub
> >> > -lmppwidth=32 -lmppnppn=16", mpirun recognizes it as 2 slots.
> >>Is it
> >> > possible to mpirun to handle mluticore nodes of XE6 properly or is
> >>there
> >> > any options to handle MCA parameters for aprun?
> >> > Regards,
> >> >
> >>-------------------------------------------------------------------------
> >>----
> >> > Keita Teranishi
> >> > Principal Member of Technical Staff
> >> > Scalable Modeling and Analysis Systems
> >> > Sandia National Laboratories
> >> > Livermore, CA 94551
> >> > +1 (925) 294-3738
> >>
> >> > _______________________________________________
> >> > users mailing list
> >> > users_at_[hidden]
> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> >
> >
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users



  • application/pgp-signature attachment: stored