Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-08-07 08:29:01


I guess this is a question for Sun: what happens if registered memory
is not freed after a process exits? Does the kernel leave it allocated?

On Aug 6, 2007, at 7:00 PM, Glenn Carver wrote:

> Just to clarify, the MPI applications exit cleanly. We have our own
> f90 code (in various configurations) and I'm also testing using
> Intel's IMB. If I watch the applications whilst they run, there is a
> drop in free memory as our code begins, the free memory then steadily
> drops as the code runs. When it exits normally, free memory increases
> but falls short of where it was before the code started. The longer
> we run the code for the bigger the final drop in memory. Taking the
> machine down to single user mode doesn't help so it's not an issue of
> processes still running. Neither can I find any files still open with
> lsof.
>
> We installed Sun's clustertools 6 (not based on openmpi) and we don't
> see the same problem. I'm currently testing whether setting
> btl_udapl_flags=1 makes a difference. I'm guessing that registered
> memory is leaking? We're also trying some mca parameters to turn off
> features we don't need to see if that makes a difference. I'll
> report back on point 2. below and further tests later. If there's
> specific mca parameters you'd like to see specified let me know?
>
> Thanks, Glenn
>
>
>> Guess I don't see how stale shared memory files would cause
>> swapping to
>> occur. Besides, the user provided no indication that the
>> applications were
>> abnormally terminating, which makes it likely we cleaned up the
>> session
>> directories as we should.
>>
>> However, we definitely leak memory (i.e., we don't free all memory
>> we malloc
>> while supporting execution of an application), so if the OS isn't
>> cleaning
>> up after us, it is quite possible we could be causing the problem as
>> described. It would appear exactly as described - a slow leak that
>> gradually
>> builds up until the "dead" area was so big that it forces
>> applications to
>> swap to find enough room to work.
>>
>> So I guess we should ask for clarification:
>>
>> 1. are the Open MPI applications exiting cleanly? Do you see any
>> stale
>> "orted" executables still in the process table?
>>
>> 2. can you check the temp directory where we would be operating?
>> This is
>> usually your /tmp directory, unless you specified some other
>> location. Look
>> for our session directories - they have a name that includes
>> "openmpi" in
>> them. Are they being cleaned up (i.e., removed) when the
>> applications exit?
>>
>> Thanks
>> Ralph
>>
>>
>> On 8/6/07 5:53 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>
>>> Unless there's something weird going on in the Solaris kernel, the
>>> only memory that we should be leaking after MPI processes exit
>>> would
>>> be shared memory files that are [somehow] not getting removed
>>> properly.
>>>
>>> Right?
>>>
>>>
>>> On Aug 6, 2007, at 8:15 AM, Ralph H Castain wrote:
>>>
>>>> Hmmm...just to clarify as I think there may be some confusion
>>>> here.
>>>>
>>>> Orte-clean will kill any outstanding Open MPI daemons (which
>>>> should
>>>> kill
>>>> their local apps) and will cleanup their associated temporary file
>>>> systems.
>>>> If you are having problems with zombied processes or stale
>>>> daemons,
>>>> then
>>>> this will hopefully help (it isn't perfect, but it helps).
>>>>
>>>> However, orte-clean will not do anything about releasing memory
>>>> that has
>>>> been "leaked" by Open MPI. We don't have any tools for doing
>>>> that, I'm
>>>> afraid.
>>>>
>>>>
>>>> On 8/6/07 8:08 AM, "Don Kerr" <Don.Kerr_at_[hidden]> wrote:
>>>>
>>>>> Glenn,
>>>>>
>>>>> With CT7 there is a utility which can be used to clean up left
>>>>> over
>>>>> cruft from stale MPI processes.
>>>>>
>>>>> % man -M /opt/SUNWhpc/man -s 1 orte-clean
>>>>>
>>>>> Achtung: This will remove current running jobs as well. Use of "-
>>>>> v" for
>>>>> verbose recommended.
>>>>>
>>>>> I would be curious if this helps.
>>>>>
>>>>> -DON
>>>>> p.s. orte-clean does not exist in the ompi v1.2 branch, it is
>>>>> in the
>>>>> trunk but I think there is an issue with it currently
>>>>>
>>>>> Ralph H Castain wrote:
>>>>>
>>>>>>
>>>>>> On 8/5/07 6:35 PM, "Glenn Carver"
>>>>>> <Glenn.Carver_at_[hidden]>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> I'd appreciate some advice and help on this one. We're having
>>>>>>> serious problems running parallel applications on our cluster.
>>>>>>> After
>>>>>>> each batch job finishes, we lose a certain amount of available
>>>>>>> memory. Additional jobs cause free memory to gradually go down
>>>>>>> until
>>>>>>> the machine starts swapping and becomes unusable or hangs.
>>>>>>> Taking the
>>>>>>> machine to single user mode doesn't restore the memory, only a
>>>>>>> reboot
>>>>>>> returns all available memory. This happens on all our nodes.
>>>>>>>
>>>>>>> We've been doing some testing to try to pin the problems down,
>>>>>>> although we still don't fully know where the problem is coming
>>>>>>> from.
>>>>>>> We have ruled out our applications (fortran codes); we see
>>>>>>> the same
>>>>>>> behaviour with Intel's IMB. We know it's not a network
>>>>>>> issue as a
>>>>>>> parallel job running solely on the 4 cores on each node
>>>>>>> produces
>>>>>>> the
>>>>>>> same effect. All nodes have been brought up to the very
>>>>>>> latest OS
>>>>>>> patches and we still see the same problem.
>>>>>>>
>>>>>>> Details: we're running Solaris 10/06, Sun Studio 12,
>>>>>>> Clustertools 7
>>>>>>> (open-mpi 1.2.1) and Sun Gridengine 6.1. Hardware is Sun X4100/
>>>>>>> X4200.
>>>>>>> Kernel version: SunOS 5.10 Generic_125101-10 on all nodes.
>>>>>>>
>>>>>>> I read in the release notes that a number of memory leaks were
>>>>>>> fixed
>>>>>>> for the 1.2.1 release but none have been noticed since so
>>>>>>> I'm not
>>>>>>> sure where the problem might be.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I'm not sure where that claim came from, but it is certainly not
>>>>>> true that
>>>>>> we haven't noticed any leaks since 1.2.1. We know we have
>>>>>> quite a
>>>>>> few memory
>>>>>> leaks in the code base, many of which are small in themselves
>>>>>> but
>>>>>> can add up
>>>>>> depending upon exactly what the application does (i.e., which
>>>>>> code paths it
>>>>>> travels). Running a simple hello_world app under valgrind
>>>>>> will show
>>>>>> significant unreleased memory.
>>>>>>
>>>>>> I doubt you will see much, if any, improvement in 1.2.4. There
>>>>>> have probably
>>>>>> been a few patches applied, but a comprehensive effort to
>>>>>> eradicate the
>>>>>> problem has not been made. It is something we are trying to
>>>>>> cleanup over
>>>>>> time, but hasn't been a crash priority as most OS's do a fairly
>>>>>> good job of
>>>>>> cleaning up when the app completes.
>>>>>>
>>>>>>
>>>>>>
>>>>>>> My next move is to try the very latest release (probably
>>>>>>> 1.2.4pre-release). As CT7 is built with sun studio 11 rather
>>>>>>> than 12
>>>>>>> which we're using, I might also try downgrading. At the moment
>>>>>>> we're
>>>>>>> rebooting our cluster nodes every day to keep things going.
>>>>>>> So any
>>>>>>> suggestions are appreciated.
>>>>>>>
>>>>>>> Thanks, Glenn
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> $ ompi_info
>>>>>>> Open MPI: 1.2.1r14096-ct7b030r1838
>>>>>>> Open MPI SVN revision: 0
>>>>>>> Open RTE: 1.2.1r14096-ct7b030r1838
>>>>>>> Open RTE SVN revision: 0
>>>>>>> OPAL: 1.2.1r14096-ct7b030r1838
>>>>>>> OPAL SVN revision: 0
>>>>>>> Prefix: /opt/SUNWhpc/HPC7.0
>>>>>>> Configured architecture: i386-pc-solaris2.10
>>>>>>> Configured by: root
>>>>>>> Configured on: Fri Mar 30 13:40:12 EDT 2007
>>>>>>> Configure host: burpen-csx10-0
>>>>>>> Built by: root
>>>>>>> Built on: Fri Mar 30 13:57:25 EDT 2007
>>>>>>> Built host: burpen-csx10-0
>>>>>>> C bindings: yes
>>>>>>> C++ bindings: yes
>>>>>>> Fortran77 bindings: yes (all)
>>>>>>> Fortran90 bindings: yes
>>>>>>> Fortran90 bindings size: trivial
>>>>>>> C compiler: cc
>>>>>>> C compiler absolute: /ws/ompi-tools/SUNWspro/SOS11/bin/cc
>>>>>>> C++ compiler: CC
>>>>>>> C++ compiler absolute: /ws/ompi-tools/SUNWspro/SOS11/bin/CC
>>>>>>> Fortran77 compiler: f77
>>>>>>> Fortran77 compiler abs: /ws/ompi-tools/SUNWspro/SOS11/bin/f77
>>>>>>> Fortran90 compiler: f95
>>>>>>> Fortran90 compiler abs: /ws/ompi-tools/SUNWspro/SOS11/bin/f95
>>>>>>> C profiling: yes
>>>>>>> C++ profiling: yes
>>>>>>> Fortran77 profiling: yes
>>>>>>> Fortran90 profiling: yes
>>>>>>> C++ exceptions: yes
>>>>>>> Thread support: no
>>>>>>> Internal debug support: no
>>>>>>> MPI parameter check: runtime
>>>>>>> Memory profiling support: no
>>>>>>> Memory debugging support: no
>>>>>>> libltdl support: yes
>>>>>>> Heterogeneous support: yes
>>>>>>> mpirun default --prefix: yes
>>>>>>> MCA backtrace: printstack (MCA v1.0, API v1.0,
>>>>>>> Component v1.2.1)
>>>>>>> MCA paffinity: solaris (MCA v1.0, API v1.0,
>>>>>>> Component
>>>>>>> v1.2.1)
>>>>>>> MCA maffinity: first_use (MCA v1.0, API v1.0,
>>>>>>> Component v1.2.1)
>>>>>>> MCA timer: solaris (MCA v1.0, API v1.0,
>>>>>>> Component
>>>>>>> v1.2.1)
>>>>>>> MCA allocator: basic (MCA v1.0, API v1.0, Component
>>>>>>> v1.0)
>>>>>>> MCA allocator: bucket (MCA v1.0, API v1.0, Component
>>>>>>> v1.0)
>>>>>>> MCA coll: basic (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA coll: self (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA coll: sm (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA coll: tuned (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA io: romio (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA mpool: sm (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA mpool: udapl (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA pml: cm (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA pml: ob1 (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA bml: r2 (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA rcache: rb (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA rcache: vma (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA btl: self (MCA v1.0, API v1.0.1, Component
>>>>>>> v1.2.1)
>>>>>>> MCA btl: sm (MCA v1.0, API v1.0.1, Component
>>>>>>> v1.2.1)
>>>>>>> MCA btl: tcp (MCA v1.0, API v1.0.1, Component
>>>>>>> v1.0)
>>>>>>> MCA btl: udapl (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA topo: unity (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA osc: pt2pt (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA errmgr: hnp (MCA v1.0, API v1.3, Component
>>>>>>> v1.2.1)
>>>>>>> MCA errmgr: orted (MCA v1.0, API v1.3, Component
>>>>>>> v1.2.1)
>>>>>>> MCA errmgr: proxy (MCA v1.0, API v1.3, Component
>>>>>>> v1.2.1)
>>>>>>> MCA gpr: null (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA gpr: proxy (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA gpr: replica (MCA v1.0, API v1.0,
>>>>>>> Component
>>>>>>> v1.2.1)
>>>>>>> MCA iof: proxy (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA iof: svc (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA ns: proxy (MCA v1.0, API v2.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA ns: replica (MCA v1.0, API v2.0,
>>>>>>> Component
>>>>>>> v1.2.1)
>>>>>>> MCA oob: tcp (MCA v1.0, API v1.0, Component
>>>>>>> v1.0)
>>>>>>> MCA ras: dash_host (MCA v1.0, API v1.3,
>>>>>>> Component v1.2.1)
>>>>>>> MCA ras: gridengine (MCA v1.0, API v1.3,
>>>>>>> Component v1.2.1)
>>>>>>> MCA ras: localhost (MCA v1.0, API v1.3,
>>>>>>> Component v1.2.1)
>>>>>>> MCA ras: tm (MCA v1.0, API v1.3, Component
>>>>>>> v1.2.1)
>>>>>>> MCA rds: hostfile (MCA v1.0, API v1.3,
>>>>>>> Component v1.2.1)
>>>>>>> MCA rds: proxy (MCA v1.0, API v1.3, Component
>>>>>>> v1.2.1)
>>>>>>> MCA rds: resfile (MCA v1.0, API v1.3,
>>>>>>> Component
>>>>>>> v1.2.1)
>>>>>>> MCA rmaps: round_robin (MCA v1.0, API v1.3,
>>>>>>> Component v1.2.1)
>>>>>>> MCA rmgr: proxy (MCA v1.0, API v2.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA rmgr: urm (MCA v1.0, API v2.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA rml: oob (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA pls: gridengine (MCA v1.0, API v1.3,
>>>>>>> Component v1.2.1)
>>>>>>> MCA pls: proxy (MCA v1.0, API v1.3, Component
>>>>>>> v1.2.1)
>>>>>>> MCA pls: rsh (MCA v1.0, API v1.3, Component
>>>>>>> v1.2.1)
>>>>>>> MCA pls: tm (MCA v1.0, API v1.3, Component
>>>>>>> v1.2.1)
>>>>>>> MCA sds: env (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA sds: pipe (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA sds: seed (MCA v1.0, API v1.0, Component
>>>>>>> v1.2.1)
>>>>>>> MCA sds: singleton (MCA v1.0, API v1.0,
>>>>>>> Component v1.2.1)
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems