Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Don Kerr (Don.Kerr_at_[hidden])
Date: 2007-08-06 10:08:47


Glenn,

With CT7 there is a utility which can be used to clean up left over
cruft from stale MPI processes.

% man -M /opt/SUNWhpc/man -s 1 orte-clean

Achtung: This will remove current running jobs as well. Use of "-v" for
verbose recommended.

I would be curious if this helps.

-DON
p.s. orte-clean does not exist in the ompi v1.2 branch, it is in the
trunk but I think there is an issue with it currently
 
Ralph H Castain wrote:

>
>On 8/5/07 6:35 PM, "Glenn Carver" <Glenn.Carver_at_[hidden]> wrote:
>
>
>
>>I'd appreciate some advice and help on this one. We're having
>>serious problems running parallel applications on our cluster. After
>>each batch job finishes, we lose a certain amount of available
>>memory. Additional jobs cause free memory to gradually go down until
>>the machine starts swapping and becomes unusable or hangs. Taking the
>>machine to single user mode doesn't restore the memory, only a reboot
>>returns all available memory. This happens on all our nodes.
>>
>>We've been doing some testing to try to pin the problems down,
>>although we still don't fully know where the problem is coming from.
>>We have ruled out our applications (fortran codes); we see the same
>>behaviour with Intel's IMB. We know it's not a network issue as a
>>parallel job running solely on the 4 cores on each node produces the
>>same effect. All nodes have been brought up to the very latest OS
>>patches and we still see the same problem.
>>
>>Details: we're running Solaris 10/06, Sun Studio 12, Clustertools 7
>>(open-mpi 1.2.1) and Sun Gridengine 6.1. Hardware is Sun X4100/X4200.
>>Kernel version: SunOS 5.10 Generic_125101-10 on all nodes.
>>
>>I read in the release notes that a number of memory leaks were fixed
>>for the 1.2.1 release but none have been noticed since so I'm not
>>sure where the problem might be.
>>
>>
>
>I'm not sure where that claim came from, but it is certainly not true that
>we haven't noticed any leaks since 1.2.1. We know we have quite a few memory
>leaks in the code base, many of which are small in themselves but can add up
>depending upon exactly what the application does (i.e., which code paths it
>travels). Running a simple hello_world app under valgrind will show
>significant unreleased memory.
>
>I doubt you will see much, if any, improvement in 1.2.4. There have probably
>been a few patches applied, but a comprehensive effort to eradicate the
>problem has not been made. It is something we are trying to cleanup over
>time, but hasn't been a crash priority as most OS's do a fairly good job of
>cleaning up when the app completes.
>
>
>
>>My next move is to try the very latest release (probably
>>1.2.4pre-release). As CT7 is built with sun studio 11 rather than 12
>>which we're using, I might also try downgrading. At the moment we're
>>rebooting our cluster nodes every day to keep things going. So any
>>suggestions are appreciated.
>>
>>Thanks, Glenn
>>
>>
>>
>>
>>$ ompi_info
>> Open MPI: 1.2.1r14096-ct7b030r1838
>> Open MPI SVN revision: 0
>> Open RTE: 1.2.1r14096-ct7b030r1838
>> Open RTE SVN revision: 0
>> OPAL: 1.2.1r14096-ct7b030r1838
>> OPAL SVN revision: 0
>> Prefix: /opt/SUNWhpc/HPC7.0
>> Configured architecture: i386-pc-solaris2.10
>> Configured by: root
>> Configured on: Fri Mar 30 13:40:12 EDT 2007
>> Configure host: burpen-csx10-0
>> Built by: root
>> Built on: Fri Mar 30 13:57:25 EDT 2007
>> Built host: burpen-csx10-0
>> C bindings: yes
>> C++ bindings: yes
>> Fortran77 bindings: yes (all)
>> Fortran90 bindings: yes
>> Fortran90 bindings size: trivial
>> C compiler: cc
>> C compiler absolute: /ws/ompi-tools/SUNWspro/SOS11/bin/cc
>> C++ compiler: CC
>> C++ compiler absolute: /ws/ompi-tools/SUNWspro/SOS11/bin/CC
>> Fortran77 compiler: f77
>> Fortran77 compiler abs: /ws/ompi-tools/SUNWspro/SOS11/bin/f77
>> Fortran90 compiler: f95
>> Fortran90 compiler abs: /ws/ompi-tools/SUNWspro/SOS11/bin/f95
>> C profiling: yes
>> C++ profiling: yes
>> Fortran77 profiling: yes
>> Fortran90 profiling: yes
>> C++ exceptions: yes
>> Thread support: no
>> Internal debug support: no
>> MPI parameter check: runtime
>>Memory profiling support: no
>>Memory debugging support: no
>> libltdl support: yes
>> Heterogeneous support: yes
>> mpirun default --prefix: yes
>> MCA backtrace: printstack (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA paffinity: solaris (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA timer: solaris (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
>> MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
>> MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA coll: self (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA io: romio (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA mpool: udapl (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA rcache: rb (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.1)
>> MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.1)
>> MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
>> MCA btl: udapl (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.1)
>> MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.1)
>> MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.1)
>> MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA iof: svc (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2.1)
>> MCA ns: replica (MCA v1.0, API v2.0, Component v1.2.1)
>> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
>> MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2.1)
>> MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.1)
>> MCA ras: localhost (MCA v1.0, API v1.3, Component v1.2.1)
>> MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.1)
>> MCA rds: hostfile (MCA v1.0, API v1.3, Component v1.2.1)
>> MCA rds: proxy (MCA v1.0, API v1.3, Component v1.2.1)
>> MCA rds: resfile (MCA v1.0, API v1.3, Component v1.2.1)
>> MCA rmaps: round_robin (MCA v1.0, API v1.3, Component v1.2.1)
>> MCA rmgr: proxy (MCA v1.0, API v2.0, Component v1.2.1)
>> MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.2.1)
>> MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.1)
>> MCA pls: proxy (MCA v1.0, API v1.3, Component v1.2.1)
>> MCA pls: rsh (MCA v1.0, API v1.3, Component v1.2.1)
>> MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.1)
>> MCA sds: env (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA sds: pipe (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA sds: seed (MCA v1.0, API v1.0, Component v1.2.1)
>> MCA sds: singleton (MCA v1.0, API v1.0, Component v1.2.1)
>>_______________________________________________
>>users mailing list
>>users_at_[hidden]
>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
>
>_______________________________________________
>users mailing list
>users_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>