Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Glenn Carver (Glenn.Carver_at_[hidden])
Date: 2007-08-05 20:35:12


I'd appreciate some advice and help on this one. We're having
serious problems running parallel applications on our cluster. After
each batch job finishes, we lose a certain amount of available
memory. Additional jobs cause free memory to gradually go down until
the machine starts swapping and becomes unusable or hangs. Taking the
machine to single user mode doesn't restore the memory, only a reboot
returns all available memory. This happens on all our nodes.

We've been doing some testing to try to pin the problems down,
although we still don't fully know where the problem is coming from.
We have ruled out our applications (fortran codes); we see the same
behaviour with Intel's IMB. We know it's not a network issue as a
parallel job running solely on the 4 cores on each node produces the
same effect. All nodes have been brought up to the very latest OS
patches and we still see the same problem.

Details: we're running Solaris 10/06, Sun Studio 12, Clustertools 7
(open-mpi 1.2.1) and Sun Gridengine 6.1. Hardware is Sun X4100/X4200.
Kernel version: SunOS 5.10 Generic_125101-10 on all nodes.

I read in the release notes that a number of memory leaks were fixed
for the 1.2.1 release but none have been noticed since so I'm not
sure where the problem might be.

My next move is to try the very latest release (probably
1.2.4pre-release). As CT7 is built with sun studio 11 rather than 12
which we're using, I might also try downgrading. At the moment we're
rebooting our cluster nodes every day to keep things going. So any
suggestions are appreciated.

Thanks, Glenn

$ ompi_info
                 Open MPI: 1.2.1r14096-ct7b030r1838
    Open MPI SVN revision: 0
                 Open RTE: 1.2.1r14096-ct7b030r1838
    Open RTE SVN revision: 0
                     OPAL: 1.2.1r14096-ct7b030r1838
        OPAL SVN revision: 0
                   Prefix: /opt/SUNWhpc/HPC7.0
  Configured architecture: i386-pc-solaris2.10
            Configured by: root
            Configured on: Fri Mar 30 13:40:12 EDT 2007
           Configure host: burpen-csx10-0
                 Built by: root
                 Built on: Fri Mar 30 13:57:25 EDT 2007
               Built host: burpen-csx10-0
               C bindings: yes
             C++ bindings: yes
       Fortran77 bindings: yes (all)
       Fortran90 bindings: yes
  Fortran90 bindings size: trivial
               C compiler: cc
      C compiler absolute: /ws/ompi-tools/SUNWspro/SOS11/bin/cc
             C++ compiler: CC
    C++ compiler absolute: /ws/ompi-tools/SUNWspro/SOS11/bin/CC
       Fortran77 compiler: f77
   Fortran77 compiler abs: /ws/ompi-tools/SUNWspro/SOS11/bin/f77
       Fortran90 compiler: f95
   Fortran90 compiler abs: /ws/ompi-tools/SUNWspro/SOS11/bin/f95
              C profiling: yes
            C++ profiling: yes
      Fortran77 profiling: yes
      Fortran90 profiling: yes
           C++ exceptions: yes
           Thread support: no
   Internal debug support: no
      MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
          libltdl support: yes
    Heterogeneous support: yes
  mpirun default --prefix: yes
            MCA backtrace: printstack (MCA v1.0, API v1.0, Component v1.2.1)
            MCA paffinity: solaris (MCA v1.0, API v1.0, Component v1.2.1)
            MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.1)
                MCA timer: solaris (MCA v1.0, API v1.0, Component v1.2.1)
            MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
            MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
                 MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA coll: self (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.1)
                   MCA io: romio (MCA v1.0, API v1.0, Component v1.2.1)
                MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.1)
                MCA mpool: udapl (MCA v1.0, API v1.0, Component v1.2.1)
                  MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.1)
                  MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.1)
                  MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.1)
               MCA rcache: rb (MCA v1.0, API v1.0, Component v1.2.1)
               MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.1)
                  MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.1)
                  MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.1)
                  MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
                  MCA btl: udapl (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.1)
                  MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.1)
               MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.1)
               MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.1)
               MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.1)
                  MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.1)
                  MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.1)
                  MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2.1)
                  MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2.1)
                  MCA iof: svc (MCA v1.0, API v1.0, Component v1.2.1)
                   MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2.1)
                   MCA ns: replica (MCA v1.0, API v2.0, Component v1.2.1)
                  MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
                  MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2.1)
                  MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.1)
                  MCA ras: localhost (MCA v1.0, API v1.3, Component v1.2.1)
                  MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.1)
                  MCA rds: hostfile (MCA v1.0, API v1.3, Component v1.2.1)
                  MCA rds: proxy (MCA v1.0, API v1.3, Component v1.2.1)
                  MCA rds: resfile (MCA v1.0, API v1.3, Component v1.2.1)
                MCA rmaps: round_robin (MCA v1.0, API v1.3, Component v1.2.1)
                 MCA rmgr: proxy (MCA v1.0, API v2.0, Component v1.2.1)
                 MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.2.1)
                  MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.1)
                  MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.1)
                  MCA pls: proxy (MCA v1.0, API v1.3, Component v1.2.1)
                  MCA pls: rsh (MCA v1.0, API v1.3, Component v1.2.1)
                  MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.1)
                  MCA sds: env (MCA v1.0, API v1.0, Component v1.2.1)
                  MCA sds: pipe (MCA v1.0, API v1.0, Component v1.2.1)
                  MCA sds: seed (MCA v1.0, API v1.0, Component v1.2.1)
                  MCA sds: singleton (MCA v1.0, API v1.0, Component v1.2.1)