Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [EXTERNAL] Re: (OpenMPI for Cray XE6 ) How to set mca parameters through aprun?
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2013-12-02 17:28:48


Ack, forgot about that. There is a bug in 1.7.3 that breaks one of LANL's default
settings. Just change the line in contrib/platform/lanl/cray_xe6/optimized-common

from:

enable_orte_static_ports=no

to:

enable_orte_static_ports=yes

That should work.

-Nathan

On Wed, Nov 27, 2013 at 08:05:48PM +0000, Teranishi, Keita wrote:
> Nathan,
>
> I got a compile-time error (see below). I use a script from
> contrib/platform/lanl/cray_xe6 with gcc-4.7.2. Is there any problem in my
> environment?
>
> Thanks,
> Keita
>
> CC oob_tcp.lo
> oob_tcp.c:353:7: error: expected identifier or '(' before 'else'
> oob_tcp.c:358:5: warning: data definition has no type or storage class
> [enabled by default]
> oob_tcp.c:358:5: warning: type defaults to 'int' in declaration of
> 'mca_oob_tcp_ipv4_dynamic_ports' [enabled by default]
> oob_tcp.c:358:5: error: conflicting types for
> 'mca_oob_tcp_ipv4_dynamic_ports'
> oob_tcp.c:140:14: note: previous definition of
> 'mca_oob_tcp_ipv4_dynamic_ports' was here
> oob_tcp.c:358:38: warning: initialization makes integer from pointer
> without a cast [enabled by default]
> oob_tcp.c:359:6: error: expected identifier or '(' before 'void'
> oob_tcp.c:367:5: error: expected identifier or '(' before 'if'
> oob_tcp.c:380:7: error: expected identifier or '(' before 'else'
> oob_tcp.c:384:26: error: expected '=', ',', ';', 'asm' or '__attribute__'
> before '.' token
> oob_tcp.c:385:30: error: expected declaration specifiers or '...' before
> string constant
> oob_tcp.c:385:48: error: expected declaration specifiers or '...' before
> 'disable_family_values'
> oob_tcp.c:385:71: error: expected declaration specifiers or '...' before
> '&' token
> oob_tcp.c:386:6: error: expected identifier or '(' before 'void'
> oob_tcp.c:391:5: error: expected identifier or '(' before 'do'
> oob_tcp.c:391:5: error: expected identifier or '(' before 'while'
> oob_tcp.c:448:5: error: expected identifier or '(' before 'return'
> oob_tcp.c:449:1: error: expected identifier or '(' before '}' token
> make[2]: *** [oob_tcp.lo] Error 1
> make[2]: Leaving directory
> `/ufs/home/knteran/openmpi-1.7.3/orte/mca/oob/tcp'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `/ufs/home/knteran/openmpi-1.7.3/orte'
>
>
>
>
>
> On 11/26/13 3:54 PM, "Nathan Hjelm" <hjelmn_at_[hidden]> wrote:
>
> >Alright, everything is identical to Cielito but it looks like you are
> >getting
> >bad data from alps.
> >
> >I think we changed some of the alps parsing for 1.7.3. Can you give that
> >version a try and let me know if it resolves your issue. If not I can add
> >better debugging to the ras/alps module.
> >
> >-Nathan
> >
> >On Tue, Nov 26, 2013 at 11:50:00PM +0000, Teranishi, Keita wrote:
> >> Here is what we can see:
> >>
> >> knteran_at_mzlogin01e:~> ls -l /opt/cray/xe-sysroot
> >> total 8
> >> drwxr-xr-x 6 bin bin 4096 2012-02-04 11:05
> >>4.0.36.securitypatch.20111221
> >> drwxr-xr-x 6 bin bin 4096 2013-01-11 15:17 4.1.40
> >> lrwxrwxrwx 1 root root 6 2013-01-11 15:19 default -> 4.1.40
> >>
> >> Thanks,
> >> Keita
> >>
> >>
> >>
> >>
> >> On 11/26/13 3:19 PM, "Nathan Hjelm" <hjelmn_at_[hidden]> wrote:
> >>
> >> >??? Alps reports that the two nodes each have one slot. What PE release
> >> >are you using. A quick way to find out is ls -l /opt/cray/xe-sysroot on
> >> >the
> >> >external login node (this directory does not exist on the internal
> >>login
> >> >nodes.)
> >> >
> >> >-Nathan
> >> >
> >> >On Tue, Nov 26, 2013 at 11:07:36PM +0000, Teranishi, Keita wrote:
> >> >> Nathan,
> >> >>
> >> >> Here it is.
> >> >>
> >> >> Keita
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 11/26/13 3:02 PM, "Nathan Hjelm" <hjelmn_at_[hidden]> wrote:
> >> >>
> >> >> >Ok, that sheds a little more light on the situation. For some
> >>reason it
> >> >> >sees 2 nodes
> >> >> >apparently with one slot each. One more set out outputs would be
> >> >>helpful.
> >> >> >Please run
> >> >> >with -mca ras_base_verbose 100 . That way I can see what was read
> >>from
> >> >> >alps.
> >> >> >
> >> >> >-Nathan
> >> >> >
> >> >> >On Tue, Nov 26, 2013 at 10:14:11PM +0000, Teranishi, Keita wrote:
> >> >> >> Nathan,
> >> >> >>
> >> >> >> I am hoping these files would help you.
> >> >> >>
> >> >> >> Thanks,
> >> >> >> Keita
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On 11/26/13 1:41 PM, "Nathan Hjelm" <hjelmn_at_[hidden]> wrote:
> >> >> >>
> >> >> >> >Well, no hints as to the error there. Looks identical to the
> >>output
> >> >>on
> >> >> >>my
> >> >> >> >XE-6. How
> >> >> >> >about setting -mca rmaps_base_verbose 100 . See what is going on
> >> >>with
> >> >> >>the
> >> >> >> >mapper.
> >> >> >> >
> >> >> >> >-Nathan Hjelm
> >> >> >> >Application Readiness, HPC-5, LANL
> >> >> >> >
> >> >> >> >On Tue, Nov 26, 2013 at 09:33:20PM +0000, Teranishi, Keita wrote:
> >> >> >> >> Nathan,
> >> >> >> >>
> >> >> >> >> Please see the attached obtained from two cases (-np 2 and -np
> >>4).
> >> >> >> >>
> >> >> >> >> Thanks,
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>-------------------------------------------------------------------
> >>>>>>>>--
> >> >>>>>>--
> >> >> >>>>--
> >> >> >> >>--
> >> >> >> >> --
> >> >> >> >> Keita Teranishi
> >> >> >> >> Principal Member of Technical Staff
> >> >> >> >> Scalable Modeling and Analysis Systems
> >> >> >> >> Sandia National Laboratories
> >> >> >> >> Livermore, CA 94551
> >> >> >> >> +1 (925) 294-3738
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On 11/26/13 1:26 PM, "Nathan Hjelm" <hjelmn_at_[hidden]> wrote:
> >> >> >> >>
> >> >> >> >> >Seems like something is going wrong with processor binding.
> >>Can
> >> >>you
> >> >> >>run
> >> >> >> >> >with
> >> >> >> >> >-mca plm_base_verbose 100 . Might shed some light on why it
> >> >>thinks
> >> >> >> >>there
> >> >> >> >> >are
> >> >> >> >> >not enough slots.
> >> >> >> >> >
> >> >> >> >> >-Nathan Hjelm
> >> >> >> >> >Application Readiness, HPC-5, LANL
> >> >> >> >> >
> >> >> >> >> >On Tue, Nov 26, 2013 at 09:18:14PM +0000, Teranishi, Keita
> >>wrote:
> >> >> >> >> >> Nathan,
> >> >> >> >> >>
> >> >> >> >> >> Now I remove strip_prefix stuff, which was applied to the
> >>other
> >> >> >> >>versions
> >> >> >> >> >> of OpenMPI.
> >> >> >> >> >> I still have the same problem with msubrun command.
> >> >> >> >> >>
> >> >> >> >> >> knteran_at_mzlogin01:~> msub -lnodes=2:ppn=16 -I
> >> >> >> >> >> qsub: waiting for job 7754058.sdb to start
> >> >> >> >> >> qsub: job 7754058.sdb ready
> >> >> >> >> >>
> >> >> >> >> >> knteran_at_mzlogin01:~> cd test-openmpi/
> >> >> >> >> >> knteran_at_mzlogin01:~/test-openmpi> !mp
> >> >> >> >> >> mpicc cpi.c -o cpi
> >> >> >> >> >> knteran_at_mzlogin01:~/test-openmpi> mpirun -np 4 ./cpi
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>>>-----------------------------------------------------------------
> >>>>>>>>>>--
> >> >>>>>>>>--
> >> >> >>>>>>--
> >> >> >> >>>>--
> >> >> >> >> >>-
> >> >> >> >> >> There are not enough slots available in the system to
> >>satisfy
> >> >>the
> >> >> >>4
> >> >> >> >> >>slots
> >> >> >> >> >> that were requested by the application:
> >> >> >> >> >> ./cpi
> >> >> >> >> >>
> >> >> >> >> >> Either request fewer slots for your application, or make
> >>more
> >> >> >>slots
> >> >> >> >> >> available
> >> >> >> >> >> for use.
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>>>-----------------------------------------------------------------
> >>>>>>>>>>--
> >> >>>>>>>>--
> >> >> >>>>>>--
> >> >> >> >>>>--
> >> >> >> >> >>-
> >> >> >> >> >>
> >> >> >> >> >> I set PATH and LD_LIBRARY_PATH to match with my own OpenMPI
> >> >> >> >> >>installation.
> >> >> >> >> >> knteran_at_mzlogin01:~/test-openmpi> which mpirun
> >> >> >> >> >> /home/knteran/openmpi/bin/mpirun
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> Thanks,
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>>>-----------------------------------------------------------------
> >>>>>>>>>>--
> >> >>>>>>>>--
> >> >> >>>>>>--
> >> >> >> >>>>--
> >> >> >> >> >>--
> >> >> >> >> >> --
> >> >> >> >> >> Keita Teranishi
> >> >> >> >> >> Principal Member of Technical Staff
> >> >> >> >> >> Scalable Modeling and Analysis Systems
> >> >> >> >> >> Sandia National Laboratories
> >> >> >> >> >> Livermore, CA 94551
> >> >> >> >> >> +1 (925) 294-3738
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> On 11/26/13 12:52 PM, "Nathan Hjelm" <hjelmn_at_[hidden]>
> >>wrote:
> >> >> >> >> >>
> >> >> >> >> >> >Weird. That is the same configuration we have deployed on
> >> >>Cielito
> >> >> >> >>and
> >> >> >> >> >> >Cielo. Does
> >> >> >> >> >> >it work under an msub allocation?
> >> >> >> >> >> >
> >> >> >> >> >> >BTW, with that configuration you should not set
> >> >> >> >> >> >plm_base_strip_prefix_from_node_names
> >> >> >> >> >> >to 0. That will confuse orte since the node hostname will
> >>not
> >> >> >>match
> >> >> >> >> >>what
> >> >> >> >> >> >was
> >> >> >> >> >> >supplied by alps.
> >> >> >> >> >> >
> >> >> >> >> >> >-Nathan
> >> >> >> >> >> >
> >> >> >> >> >> >On Tue, Nov 26, 2013 at 08:38:51PM +0000, Teranishi, Keita
> >> >>wrote:
> >> >> >> >> >> >> Nathan,
> >> >> >> >> >> >>
> >> >> >> >> >> >> (Please forget about the segfault. It was my mistake).
> >> >> >> >> >> >> I use OpenMPI-1.7.2 (build with gcc-4.7.2) to run the
> >> >>program.
> >> >> >> I
> >> >> >> >> >>used
> >> >> >> >> >> >> contrib/platform/lanl/cray_xe6/optimized_lustre and
> >> >> >> >> >> >> --enable-mpirun-prefix-by-default for configuration. As
> >>I
> >> >> >>said,
> >> >> >> >>it
> >> >> >> >> >> >>works
> >> >> >> >> >> >> fine with aprun, but fails with mpirun/mpiexec.
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >> knteran_at_mzlogin01:~/test-openmpi> ~/openmpi/bin/mpirun
> >>-np 4
> >> >> >> >>./a.out
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>>>>>---------------------------------------------------------------
> >>>>>>>>>>>>--
> >> >>>>>>>>>>--
> >> >> >>>>>>>>--
> >> >> >> >>>>>>--
> >> >> >> >> >>>>--
> >> >> >> >> >> >>-
> >> >> >> >> >> >> There are not enough slots available in the system to
> >> >>satisfy
> >> >> >>the
> >> >> >> >>4
> >> >> >> >> >> >>slots
> >> >> >> >> >> >> that were requested by the application:
> >> >> >> >> >> >> ./a.out
> >> >> >> >> >> >>
> >> >> >> >> >> >> Either request fewer slots for your application, or make
> >> >>more
> >> >> >> >>slots
> >> >> >> >> >> >> available
> >> >> >> >> >> >> for use.
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>>>>>---------------------------------------------------------------
> >>>>>>>>>>>>--
> >> >>>>>>>>>>--
> >> >> >>>>>>>>--
> >> >> >> >>>>>>--
> >> >> >> >> >>>>--
> >> >> >> >> >> >>--
> >> >> >> >> >> >> -
> >> >> >> >> >> >>
> >> >> >> >> >> >> Thanks,
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>>>>>---------------------------------------------------------------
> >>>>>>>>>>>>--
> >> >>>>>>>>>>--
> >> >> >>>>>>>>--
> >> >> >> >>>>>>--
> >> >> >> >> >>>>--
> >> >> >> >> >> >>--
> >> >> >> >> >> >> --
> >> >> >> >> >> >> Keita Teranishi
> >> >> >> >> >> >> Principal Member of Technical Staff
> >> >> >> >> >> >> Scalable Modeling and Analysis Systems
> >> >> >> >> >> >> Sandia National Laboratories
> >> >> >> >> >> >> Livermore, CA 94551
> >> >> >> >> >> >> +1 (925) 294-3738
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >> On 11/25/13 12:55 PM, "Nathan Hjelm" <hjelmn_at_[hidden]>
> >> >>wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> >Ok, that should have worked. I just double-checked it
> >>to me
> >> >> >>sure.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$
> >>mpirun
> >> >>-np
> >> >> >>32
> >> >> >> >> >> >>./bcast
> >> >> >> >> >> >> >App launch reported: 17 (out of 3) daemons - 0 (out of
> >>32)
> >> >> >>procs
> >> >> >> >> >> >> >ct-login1:/lscratch1/hjelmn/ibm/collective hjelmn$
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >How did you configure Open MPI and what version are you
> >> >>using?
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >-Nathan
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >On Mon, Nov 25, 2013 at 08:48:09PM +0000, Teranishi,
> >>Keita
> >> >> >>wrote:
> >> >> >> >> >> >> >> Hi Natan,
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> I tried qsub option you
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> mpirun -np 4 --mca
> >> >>plm_base_strip_prefix_from_node_names= 0
> >> >> >> >>./cpi
> >> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>>>>>>>-------------------------------------------------------------
> >>>>>>>>>>>>>>--
> >> >>>>>>>>>>>>--
> >> >> >>>>>>>>>>--
> >> >> >> >>>>>>>>--
> >> >> >> >> >>>>>>--
> >> >> >> >> >> >>>>--
> >> >> >> >> >> >> >>-
> >> >> >> >> >> >> >> There are not enough slots available in the system to
> >> >> >>satisfy
> >> >> >> >>the
> >> >> >> >> >>4
> >> >> >> >> >> >> >>slots
> >> >> >> >> >> >> >> that were requested by the application:
> >> >> >> >> >> >> >> ./cpi
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Either request fewer slots for your application, or
> >>make
> >> >> >>more
> >> >> >> >> >>slots
> >> >> >> >> >> >> >> available
> >> >> >> >> >> >> >> for use.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>>>>>>>-------------------------------------------------------------
> >>>>>>>>>>>>>>--
> >> >>>>>>>>>>>>--
> >> >> >>>>>>>>>>--
> >> >> >> >>>>>>>>--
> >> >> >> >> >>>>>>--
> >> >> >> >> >> >>>>--
> >> >> >> >> >> >> >>-
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Here is I got from aprun
> >> >> >> >> >> >> >> aprun -n 32 ./cpi
> >> >> >> >> >> >> >> Process 8 of 32 is on nid00011
> >> >> >> >> >> >> >> Process 5 of 32 is on nid00011
> >> >> >> >> >> >> >> Process 12 of 32 is on nid00011
> >> >> >> >> >> >> >> Process 9 of 32 is on nid00011
> >> >> >> >> >> >> >> Process 11 of 32 is on nid00011
> >> >> >> >> >> >> >> Process 13 of 32 is on nid00011
> >> >> >> >> >> >> >> Process 0 of 32 is on nid00011
> >> >> >> >> >> >> >> Process 6 of 32 is on nid00011
> >> >> >> >> >> >> >> Process 3 of 32 is on nid00011
> >> >> >> >> >> >> >> :
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> :
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Also, I found a strange error in the end of the
> >>program
> >> >> >> >> >> >>(MPI_Finalize?)
> >> >> >> >> >> >> >> Can you tell me what is wrong with that?
> >> >> >> >> >> >> >> [nid00010:23511] [ 0] /lib64/libpthread.so.0(+0xf7c0)
> >> >> >> >> >> >>[0x2aaaacbbb7c0]
> >> >> >> >> >> >> >> [nid00010:23511] [ 1]
> >> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>>>>>>>/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_i
> >>>>>>>>>>>>>>nt
> >> >>>>>>>>>>>>_f
> >> >> >>>>>>>>>>re
> >> >> >> >>>>>>>>e+
> >> >> >> >> >>>>>>0x
> >> >> >> >> >> >>>>57
> >> >> >> >> >> >> >>)
> >> >> >> >> >> >> >> [0x2aaaaaf38ec7]
> >> >> >> >> >> >> >> [nid00010:23511] [ 2]
> >> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>>>/home/knteran/openmpi/lib/libmpi.so.0(opal_memory_ptmalloc2_free+
> >>>>>>>>>>0x
> >> >>>>>>>>c3
> >> >> >>>>>>)
> >> >> >> >> >> >> >> [0x2aaaaaf3b6c3]
> >> >> >> >> >> >> >> [nid00010:23511] [ 3]
> >> >> >> >> >> >> >>
> >> >> >>/home/knteran/openmpi/lib/libmpi.so.0(mca_pml_base_close+0xb2)
> >> >> >> >> >> >> >> [0x2aaaaae717b2]
> >> >> >> >> >> >> >> [nid00010:23511] [ 4]
> >> >> >> >> >> >> >>
> >> >> >>/home/knteran/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x333)
> >> >> >> >> >> >> >> [0x2aaaaad7be23]
> >> >> >> >> >> >> >> [nid00010:23511] [ 5] ./cpi() [0x400e23]
> >> >> >> >> >> >> >> [nid00010:23511] [ 6]
> >> >> >>/lib64/libc.so.6(__libc_start_main+0xe6)
> >> >> >> >> >> >> >> [0x2aaaacde7c36]
> >> >> >> >> >> >> >> [nid00010:23511] [ 7] ./cpi() [0x400b09]
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Thanks,
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>>>>>>>-------------------------------------------------------------
> >>>>>>>>>>>>>>--
> >> >>>>>>>>>>>>--
> >> >> >>>>>>>>>>--
> >> >> >> >>>>>>>>--
> >> >> >> >> >>>>>>--
> >> >> >> >> >> >>>>--
> >> >> >> >> >> >> >>--
> >> >> >> >> >> >> >> --
> >> >> >> >> >> >> >> Keita Teranishi
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Principal Member of Technical Staff
> >> >> >> >> >> >> >> Scalable Modeling and Analysis Systems
> >> >> >> >> >> >> >> Sandia National Laboratories
> >> >> >> >> >> >> >> Livermore, CA 94551
> >> >> >> >> >> >> >> +1 (925) 294-3738
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> On 11/25/13 12:28 PM, "Nathan Hjelm" <hjelmn_at_[hidden]>
> >> >> >>wrote:
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >Just talked with our local Cray rep. Sounds like that
> >> >> >>torque
> >> >> >> >> >>syntax
> >> >> >> >> >> >>is
> >> >> >> >> >> >> >> >broken. You can continue
> >> >> >> >> >> >> >> >to use qsub (though qsub use is strongly
> >>discouraged) if
> >> >> >>you
> >> >> >> >>use
> >> >> >> >> >>the
> >> >> >> >> >> >> >>msub
> >> >> >> >> >> >> >> >options.
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >Ex:
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >qsub -lnodes=2:ppn=16
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >Works.
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >-Nathan
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >On Mon, Nov 25, 2013 at 01:11:29PM -0700, Nathan
> >>Hjelm
> >> >> >>wrote:
> >> >> >> >> >> >> >> >> Hmm, this seems like either a bug in qsub (torque
> >>is
> >> >> >>full of
> >> >> >> >> >> >>serious
> >> >> >> >> >> >> >> >>bugs) or a bug
> >> >> >> >> >> >> >> >> in alps. I got an allocation using that command and
> >> >>alps
> >> >> >> >>only
> >> >> >> >> >> >>sees 1
> >> >> >> >> >> >> >> >>node:
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >> >>Trying
> >> >> >>ALPS
> >> >> >> >> >> >> >> >>configuration file: "/etc/sysconfig/alps"
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >> >> >>parser_ini
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >> >>Trying
> >> >> >>ALPS
> >> >> >> >> >> >> >> >>configuration file: "/etc/alps.conf"
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >> >> >> >> >> >> >> >>parser_separated_columns
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >> >>Located
> >> >> >> >>ALPS
> >> >> >> >> >> >> >>scheduler
> >> >> >> >> >> >> >> >>file: "/ufs/alps_shared/appinfo"
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010]
> >> >> >> >> >> >> >> >>ras:alps:orte_ras_alps_get_appinfo_attempts: 10
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>begin
> >> >> >> >> >>processing
> >> >> >> >> >> >> >> >>appinfo file
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>file
> >> >> >> >> >> >> >> >>/ufs/alps_shared/appinfo read
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate: 47
> >> >> >>entries
> >> >> >> >>in
> >> >> >> >> >> >>file
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3492 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3492 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3541 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3541 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3560 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3560 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3561 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3561 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3566 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3566 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3573 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3573 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3588 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3588 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3598 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3598 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3599 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3599 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3622 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3622 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3635 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3635 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3640 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3640 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3641 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3641 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3642 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3642 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3647 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3647 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3651 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3651 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3653 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3653 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3659 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3659 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3662 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3662 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3665 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3665 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >>read
> >> >> >>data
> >> >> >> >>for
> >> >> >> >> >> >>resId
> >> >> >> >> >> >> >> >>3668 - myId 3668
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010]
> >> >> >>ras:alps:read_appinfo(modern):
> >> >> >> >> >> >> >>processing
> >> >> >> >> >> >> >> >>NID 29 with 16 slots
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] ras:alps:allocate:
> >> >>success
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] [[15798,0],0]
> >> >> >> >> >>ras:base:node_insert
> >> >> >> >> >> >> >> >>inserting 1 nodes
> >> >> >> >> >> >> >> >> [ct-login1.localdomain:06010] [[15798,0],0]
> >> >> >> >> >>ras:base:node_insert
> >> >> >> >> >> >> >>node 29
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> ====================== ALLOCATED NODES
> >> >> >> >> >>======================
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> Data for node: 29 Num slots: 16 Max slots: 0
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >>
> >> >> >> >>
> >> >>>>=================================================================
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> Torque also shows only one node with 16 PPN:
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> $ env | grep PBS
> >> >> >> >> >> >> >> >> ...
> >> >> >> >> >> >> >> >> PBS_NUM_PPN=16
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> $ cat /var/spool/torque/aux//915289.sdb
> >> >> >> >> >> >> >> >> login1
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> Which is wrong! I will have to ask Cray what is
> >>going
> >> >>on
> >> >> >> >>here.
> >> >> >> >> >>I
> >> >> >> >> >> >> >> >>recommend you switch to
> >> >> >> >> >> >> >> >> msub to get an allocation. Moab has fewer bugs. I
> >> >>can't
> >> >> >>even
> >> >> >> >> >>get
> >> >> >> >> >> >> >>aprun
> >> >> >> >> >> >> >> >>to work:
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> $ aprun -n 2 -N 1 hostname
> >> >> >> >> >> >> >> >> apsched: claim exceeds reservation's node-count
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> $ aprun -n 32 hostname
> >> >> >> >> >> >> >> >> apsched: claim exceeds reservation's node-count
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> To get an interactive session 2 nodes with 16 ppn
> >>on
> >> >>each
> >> >> >> >>run:
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> msub -I -lnodes=2:ppn=16
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> Open MPI should then work correctly.
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> -Nathan Hjelm
> >> >> >> >> >> >> >> >> HPC-5, LANL
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> On Sat, Nov 23, 2013 at 10:13:26PM +0000,
> >>Teranishi,
> >> >> >>Keita
> >> >> >> >> >>wrote:
> >> >> >> >> >> >> >> >> > Hi,
> >> >> >> >> >> >> >> >> > I installed OpenMPI on our small XE6 using the
> >> >> >> >>configure
> >> >> >> >> >> >>options
> >> >> >> >> >> >> >> >>under
> >> >> >> >> >> >> >> >> > /contrib directory. It appears it is working
> >> >>fine,
> >> >> >> >>but it
> >> >> >> >> >> >> >>ignores
> >> >> >> >> >> >> >> >>MCA
> >> >> >> >> >> >> >> >> > parameters (set in env var). So I switched to
> >> >> >>mpirun
> >> >> >> >>(in
> >> >> >> >> >> >> >>OpenMPI)
> >> >> >> >> >> >> >> >>and it
> >> >> >> >> >> >> >> >> > can handle MCA parameters somehow. However,
> >> >>mpirun
> >> >> >> >> >>fails to
> >> >> >> >> >> >> >> >>allocate
> >> >> >> >> >> >> >> >> > process by cores. For example, I allocated 32
> >> >>cores
> >> >> >> >>(on 2
> >> >> >> >> >> >> >>nodes)
> >> >> >> >> >> >> >> >>by "qsub
> >> >> >> >> >> >> >> >> > -lmppwidth=32 -lmppnppn=16", mpirun
> >>recognizes it
> >> >> >>as 2
> >> >> >> >> >>slots.
> >> >> >> >> >> >> >> >>Is it
> >> >> >> >> >> >> >> >> > possible to mpirun to handle mluticore nodes
> >>of
> >> >>XE6
> >> >> >> >> >>properly
> >> >> >> >> >> >>or
> >> >> >> >> >> >> >>is
> >> >> >> >> >> >> >> >>there
> >> >> >> >> >> >> >> >> > any options to handle MCA parameters for
> >>aprun?
> >> >> >> >> >> >> >> >> > Regards,
> >> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> >>>>>>>>>>>>>>>>-----------------------------------------------------------
> >>>>>>>>>>>>>>>>--
> >> >>>>>>>>>>>>>>--
> >> >> >>>>>>>>>>>>--
> >> >> >> >>>>>>>>>>--
> >> >> >> >> >>>>>>>>--
> >> >> >> >> >> >>>>>>--
> >> >> >> >> >> >> >>>>--
> >> >> >> >> >> >> >> >>----
> >> >> >> >> >> >> >> >> > Keita Teranishi
> >> >> >> >> >> >> >> >> > Principal Member of Technical Staff
> >> >> >> >> >> >> >> >> > Scalable Modeling and Analysis Systems
> >> >> >> >> >> >> >> >> > Sandia National Laboratories
> >> >> >> >> >> >> >> >> > Livermore, CA 94551
> >> >> >> >> >> >> >> >> > +1 (925) 294-3738
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >> > _______________________________________________
> >> >> >> >> >> >> >> >> > users mailing list
> >> >> >> >> >> >> >> >> > users_at_[hidden]
> >> >> >> >> >> >> >> >> >
> >>http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >> _______________________________________________
> >> >> >> >> >> >> >> >> users mailing list
> >> >> >> >> >> >> >> >> users_at_[hidden]
> >> >> >> >> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> _______________________________________________
> >> >> >> >> >> >> >> users mailing list
> >> >> >> >> >> >> >> users_at_[hidden]
> >> >> >> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >> >> >> >> >>
> >> >> >> >> >> >> _______________________________________________
> >> >> >> >> >> >> users mailing list
> >> >> >> >> >> >> users_at_[hidden]
> >> >> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >> >> >> >>
> >> >> >> >> >> _______________________________________________
> >> >> >> >> >> users mailing list
> >> >> >> >> >> users_at_[hidden]
> >> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >> _______________________________________________
> >> >> >> >> users mailing list
> >> >> >> >> users_at_[hidden]
> >> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >> >> >
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> >> _______________________________________________
> >> >> >> users mailing list
> >> >> >> users_at_[hidden]
> >> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >> >
> >> >>
> >> >
> >> >
> >> >> _______________________________________________
> >> >> users mailing list
> >> >> users_at_[hidden]
> >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users



  • application/pgp-signature attachment: stored