Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] [EXTERNAL] Re: (OpenMPI for Cray XE6 ) How to set mca parameters through aprun?
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2013-11-26 18:02:31


Ok, that sheds a little more light on the situation. For some reason it sees 2 nodes
apparently with one slot each. One more set out outputs would be helpful. Please run
with -mca ras_base_verbose 100 . That way I can see what was read from alps.

-Nathan

On Tue, Nov 26, 2013 at 10:14:11PM +0000, Teranishi, Keita wrote:
> Nathan,
>
> I am hoping these files would help you.
>
> Thanks,
> Keita
>
>
>
> On 11/26/13 1:41 PM, "Nathan Hjelm" <hjelmn_at_[hidden]> wrote:
>
> >Well, no hints as to the error there. Looks identical to the output on my
> >XE-6. How
> >about setting -mca rmaps_base_verbose 100 . See what is going on with the
> >mapper.
> >
> >-Nathan Hjelm
> >Application Readiness, HPC-5, LANL
> >
> >On Tue, Nov 26, 2013 at 09:33:20PM +0000, Teranishi, Keita wrote:
> >> Nathan,
> >>
> >> Please see the attached obtained from two cases (-np 2 and -np 4).
> >>
> >> Thanks,
> >>
> >>-------------------------------------------------------------------------
> >>--
> >> --
> >> Keita Teranishi
> >> Principal Member of Technical Staff
> >> Scalable Modeling and Analysis Systems
> >> Sandia National Laboratories
> >> Livermore, CA 94551
> >> +1 (925) 294-3738
> >>
> >>
> >>
> >>
> >>
> >> On 11/26/13 1:26 PM, "Nathan Hjelm" <hjelmn_at_[hidden]> wrote:
> >>
> >> >Seems like something is going wrong with processor binding. Can you run
> >> >with
> >> >-mca plm_base_verbose 100 . Might shed some light on why it thinks
> >>there
> >> >are
> >> >not enough slots.
> >> >
> >> >-Nathan Hjelm
> >> >Application Readiness, HPC-5, LANL
> >> >
> >> >On Tue, Nov 26, 2013 at 09:18:14PM +0000, Teranishi, Keita wrote:
> >> >> Nathan,
> >> >>
> >> >> Now I remove strip_prefix stuff, which was applied to the other
> >>versions
> >> >> of OpenMPI.
> >> >> I still have the same problem with msubrun command.
> >> >>
> >> >> knteran_at_mzlogin01:~> msub -lnodes=2:ppn=16 -I
> >> >> qsub: waiting for job 7754058.sdb to start
> >> >> qsub: job 7754058.sdb ready
> >> >>
> >> >> knteran_at_mzlogin01:~> cd test-openmpi/
> >> >> knteran_at_mzlogin01:~/test-openmpi> !mp
> >> >> mpicc cpi.c -o cpi
> >> >> knteran_at_mzlogin01:~/test-openmpi> mpirun -np 4 ./cpi
> >> >>
> >>
> >>>>-----------------------------------------------------------------------
> >>>>--
> >> >>-
> >> >> There are not enough slots available in the system to satisfy the 4
> >> >>slots
> >> >> that were requested by the application:
> >> >> ./cpi
> >> >>
> >> >> Either request fewer slots for your application, or make more slots
> >> >> available
> >> >> for use.
> >> >>
> >>
> >>>>-----------------------------------------------------------------------
> >>>>--
> >> >>-
> >> >>
> >> >> I set PATH and LD_LIBRARY_PATH to match with my own OpenMPI
> >> >>installation.
> >> >> knteran_at_mzlogin01:~/test-openmpi> which mpirun
> >> >> /home/knteran/openmpi/bin/mpirun
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> Thanks,
> >> >>
> >> >>
> >>
> >>>>-----------------------------------------------------------------------
> >>>>--
> >> >>--
> >> >> --
> >> >> Keita Teranishi
> >> >> Principal Member of Technical Staff
> >> >> Scalable Modeling and Analysis Systems
> >> >> Sandia National Laboratories
> >> >> Livermore, CA 94551
> >> >> +1 (925) 294-3738
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 11/26/13 12:52 PM, "Nathan Hjelm" <hjelmn_at_[hidden]> wrote:
> >> >>
> >> >> >Weird. That is the same configuration we have deployed on Cielito
> >>and
> >> >> >Cielo. Does
> >> >> >it work under an msub allocation?
> >> >> >
> >> >> >BTW, with that configuration you should not set
> >> >> >plm_base_strip_prefix_from_node_names
> >> >> >to 0. That will confuse orte since the node hostname will not match
> >> >>what
> >> >> >was
> >> >> >supplied by alps.
> >> >> >
> >> >> >-Nathan
> >> >> >
> >> >> >On Tue, Nov 26, 2013 at 08:38:51PM +0000, Teranishi, Keita wrote:
> >> >> >> Nathan,
> >> >> >>
> >> >> >> (Please forget about the segfault. It was my mistake).
> >> >> >> I use OpenMPI-1.7.2 (build with gcc-4.7.2) to run the program. I
> >> >>used
> >> >> >> contrib/platform/lanl/cray_xe6/optimized_lustre and
> >> >> >> --enable-mpirun-prefix-by-default for configuration. As I said,
> >>it
> >> >> >>works
> >> >> >> fine with aprun, but fails with mpirun/mpiexec.
> >> >> >>
> >> >> >>
> >> >> >> knteran_at_mzlogin01:~/test-openmpi> ~/openmpi/bin/mpirun -np 4
> >>./a.out
> >> >> >>
> >> >>
> >>
> >>>>>>---------------------------------------------------------------------
> >>>>>>--
> >> >>>>--
> >> >> >>-
> >> >> >> There are not enough slots available in the system to satisfy the
> >>4
> >> >> >>slots
> >> >> >> that were requested by the application:
> >> >> >> ./a.out
> >> >> >>
> >> >> >> Either request fewer slots for your application, or make more
> >>slots
> >> >> >> available
> >> >> >> for use.
> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>---------------------------------------------------------------------
> >>>>>>--
> >> >>>>--
> >> >> >>--
> >> >> >> -
> >> >> >>
> >> >> >> Thanks,
> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>---------------------------------------------------------------------
> >>>>>>--
> >> >>>>--
> >> >> >>--
> >> >> >> --
> >> >> >> Keita Teranishi
> >> >> >> Principal Member of Technical Staff
> >> >> >> Scalable Modeling and Analysis Systems
> >> >> >> Sandia National Laboratories
> >> >> >> Livermore, CA 94551
> >> >> >> +1 (925) 294-3738
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On 11/25/13 12:55 PM, "Nathan Hjelm" <hjelmn_at_[hidden]> wrote:
> >> >> >>
> >> >> >> >Ok, that should have worked. I just double-checked it to me sure.
> >> >> >> >
> >> >> >> >ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$ mpirun -np 32
> >> >> >>./bcast
> >> >> >> >App launch reported: 17 (out of 3) daemons - 0 (out of 32) procs
> >> >> >> >ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$
> >> >> >> >
> >> >> >> >
> >> >> >> >How did you configure Open MPI and what version are you using?
> >> >> >> >
> >> >> >> >-Nathan
> >> >> >> >
> >> >> >> >On Mon, Nov 25, 2013 at 08:48:09PM +0000, Teranishi, Keita wrote:
> >> >> >> >> Hi Natan,
> >> >> >> >>
> >> >> >> >> I tried qsub option you
> >> >> >> >>
> >> >> >> >> mpirun -np 4 --mca plm_base_strip_prefix_from_node_names= 0
> >>./cpi
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>-------------------------------------------------------------------
> >>>>>>>>--
> >> >>>>>>--
> >> >> >>>>--
> >> >> >> >>-
> >> >> >> >> There are not enough slots available in the system to satisfy
> >>the
> >> >>4
> >> >> >> >>slots
> >> >> >> >> that were requested by the application:
> >> >> >> >> ./cpi
> >> >> >> >>
> >> >> >> >> Either request fewer slots for your application, or make more
> >> >>slots
> >> >> >> >> available
> >> >> >> >> for use.
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>-------------------------------------------------------------------
> >>>>>>>>--
> >> >>>>>>--
> >> >> >>>>--
> >> >> >> >>-
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> Here is I got from aprun
> >> >> >> >> aprun -n 32 ./cpi
> >> >> >> >> Process 8 of 32 is on nid00011
> >> >> >> >> Process 5 of 32 is on nid00011
> >> >> >> >> Process 12 of 32 is on nid00011
> >> >> >> >> Process 9 of 32 is on nid00011
> >> >> >> >> Process 11 of 32 is on nid00011
> >> >> >> >> Process 13 of 32 is on nid00011
> >> >> >> >> Process 0 of 32 is on nid00011
> >> >> >> >> Process 6 of 32 is on nid00011
> >> >> >> >> Process 3 of 32 is on nid00011
> >> >> >> >> :
> >> >> >> >>
> >> >> >> >> :
> >> >> >> >>
> >> >> >> >> Also, I found a strange error in the end of the program
> >> >> >>(MPI_Finalize?)
> >> >> >> >> Can you tell me what is wrong with that?
> >> >> >> >> [nid00010:23511] [ 0] /lib64/libpthread.so.0(+0xf7c0)
> >> >> >>[0x2aaaacbbb7c0]
> >> >> >> >> [nid00010:23511] [ 1]
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_int_fre
> >>>>>>>>e+
> >> >>>>>>0x
> >> >> >>>>57
> >> >> >> >>)
> >> >> >> >> [0x2aaaaaf38ec7]
> >> >> >> >> [nid00010:23511] [ 2]
> >> >> >> >>
> >> >>
> >>>>/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_free+0xc3)
> >> >> >> >> [0x2aaaaaf3b6c3]
> >> >> >> >> [nid00010:23511] [ 3]
> >> >> >> >> /home/knteran/openmpi/lib/libmpi.so.0(mca_pml_base_close+0xb2)
> >> >> >> >> [0x2aaaaae717b2]
> >> >> >> >> [nid00010:23511] [ 4]
> >> >> >> >> /home/knteran/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x333)
> >> >> >> >> [0x2aaaaad7be23]
> >> >> >> >> [nid00010:23511] [ 5] ./cpi() [0x400e23]
> >> >> >> >> [nid00010:23511] [ 6] /lib64/libc.so.6(__libc_start_main+0xe6)
> >> >> >> >> [0x2aaaacde7c36]
> >> >> >> >> [nid00010:23511] [ 7] ./cpi() [0x400b09]
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> Thanks,
> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>-------------------------------------------------------------------
> >>>>>>>>--
> >> >>>>>>--
> >> >> >>>>--
> >> >> >> >>--
> >> >> >> >> --
> >> >> >> >> Keita Teranishi
> >> >> >> >>
> >> >> >> >> Principal Member of Technical Staff
> >> >> >> >> Scalable Modeling and Analysis Systems
> >> >> >> >> Sandia National Laboratories
> >> >> >> >> Livermore, CA 94551
> >> >> >> >> +1 (925) 294-3738
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On 11/25/13 12:28 PM, "Nathan Hjelm" <hjelmn_at_[hidden]> wrote:
> >> >> >> >>
> >> >> >> >> >Just talked with our local Cray rep. Sounds like that torque
> >> >>syntax
> >> >> >>is
> >> >> >> >> >broken. You can continue
> >> >> >> >> >to use qsub (though qsub use is strongly discouraged) if you
> >>use
> >> >>the
> >> >> >> >>msub
> >> >> >> >> >options.
> >> >> >> >> >
> >> >> >> >> >Ex:
> >> >> >> >> >
> >> >> >> >> >qsub -lnodes=2:ppn=16
> >> >> >> >> >
> >> >> >> >> >Works.
> >> >> >> >> >
> >> >> >> >> >-Nathan
> >> >> >> >> >
> >> >> >> >> >On Mon, Nov 25, 2013 at 01:11:29PM -0700, Nathan Hjelm wrote:
> >> >> >> >> >> Hmm, this seems like either a bug in qsub (torque is full of
> >> >> >>serious
> >> >> >> >> >>bugs) or a bug
> >> >> >> >> >> in alps. I got an allocation using that command and alps
> >>only
> >> >> >>sees 1
> >> >> >> >> >>node:
> >> >> >> >> >>
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS
> >> >> >> >> >>configuration file: "/etc/sysconfig/alps"
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: parser_ini
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: Trying ALPS
> >> >> >> >> >>configuration file: "/etc/alps.conf"
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >> >> >> >> >>parser_separated_columns
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: Located
> >>ALPS
> >> >> >> >>scheduler
> >> >> >> >> >>file: "/ufs/alps_shared/appinfo"
> >> >> >> >> >> [ct-login1.localdomain:06010]
> >> >> >> >> >>ras:alps:orte_ras_alps_get_appinfo_attempts: 10
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: begin
> >> >>processing
> >> >> >> >> >>appinfo file
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: file
> >> >> >> >> >>/ufs/alps_shared/appinfo read
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: 47 entries
> >>in
> >> >> >>file
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3492 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3492 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3541 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3541 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3560 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3560 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3561 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3561 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3566 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3566 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3573 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3573 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3588 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3588 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3598 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3598 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3599 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3599 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3622 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3622 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3635 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3635 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3640 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3640 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3641 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3641 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3642 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3642 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3647 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3647 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3651 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3651 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3653 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3653 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3659 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3659 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3662 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3662 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3665 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3665 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: read data
> >>for
> >> >> >>resId
> >> >> >> >> >>3668 - myId 3668
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:read_appinfo(modern):
> >> >> >> >>processing
> >> >> >> >> >>NID 29 with 16 slots
> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: success
> >> >> >> >> >> [ct-login1.localdomain:06010] [[15798,0],0]
> >> >>ras:base:node_insert
> >> >> >> >> >>inserting 1 nodes
> >> >> >> >> >> [ct-login1.localdomain:06010] [[15798,0],0]
> >> >>ras:base:node_insert
> >> >> >> >>node 29
> >> >> >> >> >>
> >> >> >> >> >> ====================== ALLOCATED NODES
> >> >>======================
> >> >> >> >> >>
> >> >> >> >> >> Data for node: 29 Num slots: 16 Max slots: 0
> >> >> >> >> >>
> >> >> >> >> >>
> >> >>=================================================================
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> Torque also shows only one node with 16 PPN:
> >> >> >> >> >>
> >> >> >> >> >> $ env | grep PBS
> >> >> >> >> >> ...
> >> >> >> >> >> PBS_NUM_PPN=16
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> $ cat /var/spool/torque/aux//915289.sdb
> >> >> >> >> >> login1
> >> >> >> >> >>
> >> >> >> >> >> Which is wrong! I will have to ask Cray what is going on
> >>here.
> >> >>I
> >> >> >> >> >>recommend you switch to
> >> >> >> >> >> msub to get an allocation. Moab has fewer bugs. I can't even
> >> >>get
> >> >> >> >>aprun
> >> >> >> >> >>to work:
> >> >> >> >> >>
> >> >> >> >> >> $ aprun -n 2 -N 1 hostname
> >> >> >> >> >> apsched: claim exceeds reservation's node-count
> >> >> >> >> >>
> >> >> >> >> >> $ aprun -n 32 hostname
> >> >> >> >> >> apsched: claim exceeds reservation's node-count
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> To get an interactive session 2 nodes with 16 ppn on each
> >>run:
> >> >> >> >> >>
> >> >> >> >> >> msub -I -lnodes=2:ppn=16
> >> >> >> >> >>
> >> >> >> >> >> Open MPI should then work correctly.
> >> >> >> >> >>
> >> >> >> >> >> -Nathan Hjelm
> >> >> >> >> >> HPC-5, LANL
> >> >> >> >> >>
> >> >> >> >> >> On Sat, Nov 23, 2013 at 10:13:26PM +0000, Teranishi, Keita
> >> >>wrote:
> >> >> >> >> >> > Hi,
> >> >> >> >> >> > I installed OpenMPI on our small XE6 using the
> >>configure
> >> >> >>options
> >> >> >> >> >>under
> >> >> >> >> >> > /contrib directory. It appears it is working fine,
> >>but it
> >> >> >> >>ignores
> >> >> >> >> >>MCA
> >> >> >> >> >> > parameters (set in env var). So I switched to mpirun
> >>(in
> >> >> >> >>OpenMPI)
> >> >> >> >> >>and it
> >> >> >> >> >> > can handle MCA parameters somehow. However, mpirun
> >> >>fails to
> >> >> >> >> >>allocate
> >> >> >> >> >> > process by cores. For example, I allocated 32 cores
> >>(on 2
> >> >> >> >>nodes)
> >> >> >> >> >>by "qsub
> >> >> >> >> >> > -lmppwidth=32 -lmppnppn=16", mpirun recognizes it as 2
> >> >>slots.
> >> >> >> >> >>Is it
> >> >> >> >> >> > possible to mpirun to handle mluticore nodes of XE6
> >> >>properly
> >> >> >>or
> >> >> >> >>is
> >> >> >> >> >>there
> >> >> >> >> >> > any options to handle MCA parameters for aprun?
> >> >> >> >> >> > Regards,
> >> >> >> >> >> >
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>>>-----------------------------------------------------------------
> >>>>>>>>>>--
> >> >>>>>>>>--
> >> >> >>>>>>--
> >> >> >> >>>>--
> >> >> >> >> >>----
> >> >> >> >> >> > Keita Teranishi
> >> >> >> >> >> > Principal Member of Technical Staff
> >> >> >> >> >> > Scalable Modeling and Analysis Systems
> >> >> >> >> >> > Sandia National Laboratories
> >> >> >> >> >> > Livermore, CA 94551
> >> >> >> >> >> > +1 (925) 294-3738
> >> >> >> >> >>
> >> >> >> >> >> > _______________________________________________
> >> >> >> >> >> > users mailing list
> >> >> >> >> >> > users_at_[hidden]
> >> >> >> >> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >> _______________________________________________
> >> >> >> >> >> users mailing list
> >> >> >> >> >> users_at_[hidden]
> >> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >> _______________________________________________
> >> >> >> >> users mailing list
> >> >> >> >> users_at_[hidden]
> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >> >>
> >> >> >> _______________________________________________
> >> >> >> users mailing list
> >> >> >> users_at_[hidden]
> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >>
> >> >> _______________________________________________
> >> >> users mailing list
> >> >> users_at_[hidden]
> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> >
> >
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>

> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users



  • application/pgp-signature attachment: stored