Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [sge::tight-integration] slot scheduling and resources handling
From: Eloi Gaudry (eg_at_[hidden])
Date: 2010-05-21 11:19:35


Hi Reuti,

Yes, the openmpi binaries used were build after having used the --with-sge
during configure, and we only use those binaries on our cluster.

[eg_at_moe:~]$ /opt/openmpi-1.3.3/bin/ompi_info
                 Package: Open MPI root_at_moe Distribution
                Open MPI: 1.3.3
   Open MPI SVN revision: r21666
   Open MPI release date: Jul 14, 2009
                Open RTE: 1.3.3
   Open RTE SVN revision: r21666
   Open RTE release date: Jul 14, 2009
                    OPAL: 1.3.3
       OPAL SVN revision: r21666
       OPAL release date: Jul 14, 2009
            Ident string: 1.3.3
                  Prefix: /opt/openmpi-1.3.3
 Configured architecture: x86_64-unknown-linux-gnu
          Configure host: moe
           Configured by: root
           Configured on: Tue Nov 10 11:19:34 CET 2009
          Configure host: moe
                Built by: root
                Built on: Tue Nov 10 11:28:14 CET 2009
              Built host: moe
              C bindings: yes
            C++ bindings: yes
      Fortran77 bindings: yes (all)
      Fortran90 bindings: yes
 Fortran90 bindings size: small
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
      Fortran77 compiler: gfortran
  Fortran77 compiler abs: /usr/bin/gfortran
      Fortran90 compiler: gfortran
  Fortran90 compiler abs: /usr/bin/gfortran
             C profiling: yes
           C++ profiling: yes
     Fortran77 profiling: yes
     Fortran90 profiling: yes
          C++ exceptions: yes
          Thread support: posix (mpi: no, progress: no)
           Sparse Groups: no
  Internal debug support: no
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
         libltdl support: yes
   Heterogeneous support: no
 mpirun default --prefix: no
         MPI I/O support: yes
       MPI_WTIME support: gettimeofday
Symbol visibility support: yes
   FT Checkpoint support: no (checkpoint thread: no)
           MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.3)
              MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3)
           MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.3)
               MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.3)
               MCA carto: file (MCA v2.0, API v2.0, Component v1.3.3)
           MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.3)
               MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.3)
         MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.3)
         MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.3)
              MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.3)
           MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.3)
           MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: self (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.3)
                MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.3)
                  MCA io: romio (MCA v2.0, API v2.0, Component v1.3.3)
               MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.3)
               MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.3)
               MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA pml: v (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.3)
              MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA btl: gm (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA btl: self (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.3)
                MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.3)
                MCA odls: default (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ras: slurm (MCA v2.0, API v2.0, Component v1.3.3)
               MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.3.3)
               MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.3.3)
               MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA rml: oob (MCA v2.0, API v2.0, Component v1.3.3)
              MCA routed: binomial (MCA v2.0, API v2.0, Component v1.3.3)
              MCA routed: direct (MCA v2.0, API v2.0, Component v1.3.3)
              MCA routed: linear (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA plm: rsh (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA plm: slurm (MCA v2.0, API v2.0, Component v1.3.3)
               MCA filem: rsh (MCA v2.0, API v2.0, Component v1.3.3)
              MCA errmgr: default (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ess: env (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ess: hnp (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ess: singleton (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ess: slurm (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA ess: tool (MCA v2.0, API v2.0, Component v1.3.3)
             MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.3.3)
             MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.3.3)

Regards,
Eloi

On Friday 21 May 2010 16:01:54 Reuti wrote:
> Hi,
>
> Am 21.05.2010 um 14:11 schrieb Eloi Gaudry:
> > Hi there,
> >
> > I'm observing something strange on our cluster managed by SGE6.2u4 when
> > launching a parallel computation on several nodes, using OpenMPI/SGE
> > tight- integration mode (OpenMPI-1.3.3). It seems that the SGE allocated
> > slots are not used by OpenMPI, as if OpenMPI was doing is own
> > round-robin allocation based on the allocated node hostnames.
>
> you compiled Open MPI with --with-sge (and recompiled your applications)?
> You are using the correct mpiexec?
>
> -- Reuti
>
> > Here is what I'm doing:
> > - launch a parallel computation involving 8 processors, using for each of
> > them 14GB of memory. I'm using a qsub command where i request
> > memory_free resource and use tight integration with openmpi
> > - 3 servers are available:
> > . barney with 4 cores (4 slots) and 32GB
> > . carl with 4 cores (4 slots) and 32GB
> > . charlie with 8 cores (8 slots) and 64GB
> >
> > Here is the output of the allocated nodes (OpenMPI output):
> > ====================== ALLOCATED NODES ======================
> >
> > Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2
> >
> > Daemon: [[44332,0],0] Daemon launched: True
> > Num slots: 4 Slots in use: 0
> > Num slots allocated: 4 Max slots: 0
> > Username on node: NULL
> > Num procs: 0 Next node_rank: 0
> >
> > Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2
> >
> > Daemon: Not defined Daemon launched: False
> > Num slots: 2 Slots in use: 0
> > Num slots allocated: 2 Max slots: 0
> > Username on node: NULL
> > Num procs: 0 Next node_rank: 0
> >
> > Data for node: Name: barney.fft Launch id: -1 Arch: 0 State: 2
> >
> > Daemon: Not defined Daemon launched: False
> > Num slots: 2 Slots in use: 0
> > Num slots allocated: 2 Max slots: 0
> > Username on node: NULL
> > Num procs: 0 Next node_rank: 0
> >
> > =================================================================
> >
> > Here is what I see when my computation is running on the cluster:
> > # rank pid hostname
> >
> > 0 28112 charlie
> > 1 11417 carl
> > 2 11808 barney
> > 3 28113 charlie
> > 4 11418 carl
> > 5 11809 barney
> > 6 28114 charlie
> > 7 11419 carl
> >
> > Note that -the parallel environment used under SGE is defined as:
> > [eg_at_moe:~]$ qconf -sp round_robin
> > pe_name round_robin
> > slots 32
> > user_lists NONE
> > xuser_lists NONE
> > start_proc_args /bin/true
> > stop_proc_args /bin/true
> > allocation_rule $round_robin
> > control_slaves TRUE
> > job_is_first_task FALSE
> > urgency_slots min
> > accounting_summary FALSE
> >
> > I'm wondering why OpenMPI didn't use the allocated nodes chosen by SGE
> > (cf. "ALLOCATED NODES" report) but instead allocate each job of the
> > parallel computation at a time, using a round-robin method.
> >
> > Note that I'm using the '--bynode' option in the orterun command line. If
> > the behavior I'm observing is simply the consequence of using this
> > option, please let me know. This would eventually mean that one need to
> > state that SGE tight- integration has a lower priority on orterun
> > behavior than the different command line options.
> >
> > Any help would be appreciated,
> > Thanks,
> > Eloi
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Eloi Gaudry
Free Field Technologies
Axis Park Louvain-la-Neuve
Rue Emile Francqui, 1
B-1435 Mont-Saint Guibert
BELGIUM
Company Phone: +32 10 487 959
Company Fax:   +32 10 454 626