Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Glenn Carver (Glenn.Carver_at_[hidden])
Date: 2007-08-07 16:15:34


Don,

Following up on this, here are the results of the tests. All is well
until udapl is included. In addition there are no mca parameters set
in these jobs. As I reported to you before, if I add --mca
btl_udapl_flags=1, the memory problem goes away.

The batch jobs run vmstat before and after the mpirun command. Here's
the appropriate part of the batch output from the 3 tests. The
problem is highlighted by the difference in the 'free' column
reported by vmstat before and after mpirun. You'll notice a drop of
145Mb in the case with 'btl self,sm,tcp,udapl'.

Regards, Glenn.

======== btl self,tcp
+ vmstat 3 3
  kthr memory page disk faults cpu
  r b w swap free re mf pi po fr de sr s0 s1 s2 s5 in sy cs us sy id
  0 0 0 6923680 2189060 6 97 5 0 0 0 15 0 0 0 1 3809 369393
2324 27 10 62
  0 0 0 6803144 1964320 1 22 0 0 0 0 0 0 0 0 0 587 388 184 0 0 100
  0 0 0 6803112 1964292 0 0 0 0 0 0 0 0 0 0 0 442 329 144 0 0 100
+ mpirun --mca btl self,tcp -np 16 ./IMB-MPI1.ct7.studio12 -npmin 16
-map 4x4 -multi 1
+ vmstat 3 3
  kthr memory page disk faults cpu
  r b w swap free re mf pi po fr de sr s0 s1 s2 s5 in sy cs us sy id
  0 0 0 6780740 2144660 6 98 5 0 0 0 14 0 0 0 1 5145 455335
3147 27 14 59
  0 0 0 6799020 1959984 3 31 0 0 0 0 0 0 0 0 0 640 358 268 0 0 100
  0 0 0 6799012 1959980 0 0 0 0 0 0 0 0 0 0 0 432 305 128 0 0 100

========== btl self,sm,tcp
+ vmstat 3 3
  kthr memory page disk faults cpu
  r b w swap free re mf pi po fr de sr s0 s1 s2 s5 in sy cs us sy id
  0 0 0 9038736 2291420 7 107 6 0 0 0 20 0 0 0 1 2445 164773 1373 28 7 65
  0 0 0 9084592 2149496 1 22 0 0 0 0 0 0 0 0 0 537 343 170 0 0 100
  0 0 0 9084580 2149488 0 0 0 0 0 0 0 0 0 0 0 527 357 168 0 0 100
+ mpirun --mca btl self,sm,tcp -np 16 ./IMB-MPI1.ct7.studio12 -npmin
16 -map 4x4 -multi 1
+ vmstat 3 3
  kthr memory page disk faults cpu
  r b w swap free re mf pi po fr de sr s0 s1 s2 s5 in sy cs us sy id
  0 0 0 8879504 2239168 7 106 6 0 0 0 18 0 0 0 1 4205 416635
2470 29 12 60
  0 0 0 9079008 2143824 3 32 0 0 0 0 0 0 0 0 0 648 358 279 0 0 100
  0 0 0 9079000 2143820 0 0 0 0 0 0 0 0 0 0 0 433 327 133 0 0 100

========= btl self,sm,tcp,udapl
+ vmstat 3 3
  kthr memory page disk faults cpu
  r b w swap free re mf pi po fr de sr s0 s1 s2 s5 in sy cs us sy id
  0 0 0 6771044 2134784 6 101 5 0 0 0 14 0 0 0 1 5060 447191
3094 28 14 58
  0 0 0 6799340 1960104 1 22 0 0 0 0 0 0 0 0 0 538 320 164 0 0 100
  0 0 0 6799328 1960096 0 0 0 0 0 0 0 0 0 0 0 439 321 139 0 0 100
+ mpirun --mca btl self,sm,tcp,udapl -np 16 ./IMB-MPI1.ct7.studio12
-npmin 16 -map 4x4 -multi 1
+ vmstat 3 3
  kthr memory page disk faults cpu
  r b w swap free re mf pi po fr de sr s0 s1 s2 s5 in sy cs us sy id
  0 0 0 6726824 2120420 6 105 4 0 0 0 13 0 0 0 1 4967 438387
3035 29 14 57
  0 0 0 6654032 1814788 3 31 0 0 0 0 0 0 0 0 0 656 457 284 0 0 100
  0 0 0 6654024 1814784 0 0 0 0 0 0 0 0 0 0 0 453 336 146 0 0 100

>Glenn,
>
>While I look into the possibility of registered memory not being freed
>could you run your same tests but without shared memory or udapl.
>
>"--mca btl self,tcp"
>
>If this is successful, i.e. frees memory as expected. The next step
>would be to run including shared memory, "--mca btl self,sm,tcp". If
>this is successful the last step would be to add in udapl, "--mca btl
>self,sm,udapl".
>
>-DON
>
>Glenn Carver wrote:
>
>>Just to clarify, the MPI applications exit cleanly. We have our own
>>f90 code (in various configurations) and I'm also testing using
> >Intel's IMB. If I watch the applications whilst they run, there is a
>>drop in free memory as our code begins, the free memory then steadily
>>drops as the code runs. When it exits normally, free memory increases
>>but falls short of where it was before the code started. The longer
>>we run the code for the bigger the final drop in memory. Taking the
>>machine down to single user mode doesn't help so it's not an issue of
>>processes still running. Neither can I find any files still open with
>>lsof.
>>
>>We installed Sun's clustertools 6 (not based on openmpi) and we don't
>>see the same problem. I'm currently testing whether setting
>>btl_udapl_flags=1 makes a difference. I'm guessing that registered
>>memory is leaking? We're also trying some mca parameters to turn off
>>features we don't need to see if that makes a difference. I'll
>>report back on point 2. below and further tests later. If there's
>>specific mca parameters you'd like to see specified let me know?
>>
>>Thanks, Glenn
>>
>>
>>
>>
>>>Guess I don't see how stale shared memory files would cause swapping to
>>>occur. Besides, the user provided no indication that the applications were
>>>abnormally terminating, which makes it likely we cleaned up the session
>>>directories as we should.
>>>
>>>However, we definitely leak memory (i.e., we don't free all memory we malloc
>>>while supporting execution of an application), so if the OS isn't cleaning
>>>up after us, it is quite possible we could be causing the problem as
>>>described. It would appear exactly as described - a slow leak that gradually
>>>builds up until the "dead" area was so big that it forces applications to
>>>swap to find enough room to work.
>>>
>>>So I guess we should ask for clarification:
>>>
>>>1. are the Open MPI applications exiting cleanly? Do you see any stale
>>>"orted" executables still in the process table?
>>>
>>>2. can you check the temp directory where we would be operating? This is
>>>usually your /tmp directory, unless you specified some other location. Look
>>>for our session directories - they have a name that includes "openmpi" in
>>>them. Are they being cleaned up (i.e., removed) when the applications exit?
>>>
>>>Thanks
>>>Ralph
>>>
>>>
>>>On 8/6/07 5:53 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>
>>>
>>>
>>>> Unless there's something weird going on in the Solaris kernel, the
>>>> only memory that we should be leaking after MPI processes exit would
> >>> be shared memory files that are [somehow] not getting removed properly.
>>>>
>>>> Right?
>>>>
>>>>
>>>> On Aug 6, 2007, at 8:15 AM, Ralph H Castain wrote:
>>>>
>>>>
>>>>
>>>>> Hmmm...just to clarify as I think there may be some confusion here.
>>>>>
>>>>> Orte-clean will kill any outstanding Open MPI daemons (which should
>>>>> kill
>>>>> their local apps) and will cleanup their associated temporary file
>>>>> systems.
>>>>> If you are having problems with zombied processes or stale daemons,
>>>>> then
>>>>> this will hopefully help (it isn't perfect, but it helps).
>>>>>
>>>>> However, orte-clean will not do anything about releasing memory
>>>>> that has
>>>>> been "leaked" by Open MPI. We don't have any tools for doing that, I'm
>>>>> afraid.
>>>>>
>>>>>
>>>>> On 8/6/07 8:08 AM, "Don Kerr" <Don.Kerr_at_[hidden]> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Glenn,
>>>>>>
>>>>>> With CT7 there is a utility which can be used to clean up left over
>>>>>> cruft from stale MPI processes.
>>>>>>
>>>>>> % man -M /opt/SUNWhpc/man -s 1 orte-clean
>>>>>>
>>>>>> Achtung: This will remove current running jobs as well. Use of "-
>>>>>> v" for
>>>>>> verbose recommended.
>>>>>>
>>>>>> I would be curious if this helps.
>>>>>>
>>>>>> -DON
>>>>>> p.s. orte-clean does not exist in the ompi v1.2 branch, it is in the
>>>>>> trunk but I think there is an issue with it currently
>>>>>>
>>>>>> Ralph H Castain wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On 8/5/07 6:35 PM, "Glenn Carver" <Glenn.Carver_at_[hidden]>
>>>>>>>
>>>>>>>
>>> >>>> wrote:
>>>
>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> I'd appreciate some advice and help on this one. We're having
>>>>>>>> serious problems running parallel applications on our cluster.
> >>>>>>>
>>>>>>>>
>>> >>>>> After
>>>
>>>
>>>>>>>> each batch job finishes, we lose a certain amount of available
>>>>>>>> memory. Additional jobs cause free memory to gradually go down
>>>>>>>> until
>>>>>>>> the machine starts swapping and becomes unusable or hangs.
>>>>>>>> Taking the
>>>>>>>> machine to single user mode doesn't restore the memory, only a
>>>>>>>> reboot
>>>>>>>> returns all available memory. This happens on all our nodes.
>>>>>>>>
>>>>>>>> We've been doing some testing to try to pin the problems down,
>>>>>>>> although we still don't fully know where the problem is coming
>>>>>>>> from.
>>>>>>>> We have ruled out our applications (fortran codes); we see the same
>>>>>>>> behaviour with Intel's IMB. We know it's not a network issue as a
>>>>>>>> parallel job running solely on the 4 cores on each node produces
>>>>>>>> the
>>>>>>>> same effect. All nodes have been brought up to the very latest OS
>>>>>>>> patches and we still see the same problem.
>>>>>>>>
>>>>>>>> Details: we're running Solaris 10/06, Sun Studio 12, Clustertools 7
>>>>>>>> (open-mpi 1.2.1) and Sun Gridengine 6.1. Hardware is Sun X4100/
>>>>>>>> X4200.
>>>>>>>> Kernel version: SunOS 5.10 Generic_125101-10 on all nodes.
>>>>>>>>
>>>>>>>> I read in the release notes that a number of memory leaks were
>>>>>>>> fixed
>>>>>>>> for the 1.2.1 release but none have been noticed since so I'm not
>>>>>>>> sure where the problem might be.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> I'm not sure where that claim came from, but it is certainly not
>>>>>>> true that
>>>>>>> we haven't noticed any leaks since 1.2.1. We know we have quite a
>>>>>>> few memory
>>>>>>> leaks in the code base, many of which are small in themselves but
>>>>>>> can add up
>>>>>>> depending upon exactly what the application does (i.e., which
>>>>>>> code paths it
>>>>>>> travels). Running a simple hello_world app under valgrind will show
>>>>>>> significant unreleased memory.
>>>>>>>
>>>>>>> I doubt you will see much, if any, improvement in 1.2.4. There
>>>>>>> have probably
>>>>>>> been a few patches applied, but a comprehensive effort to
>>>>>>> eradicate the
>>>>>>> problem has not been made. It is something we are trying to
>>>>>>> cleanup over
>>>>>>> time, but hasn't been a crash priority as most OS's do a fairly
>>>>>>> good job of
>>>>>>> cleaning up when the app completes.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> My next move is to try the very latest release (probably
> >>>>>>> 1.2.4pre-release). As CT7 is built with sun studio 11 rather
>>>>>>>> than 12
>>>>>>>> which we're using, I might also try downgrading. At the moment
>>>>>>>> we're
>>>>>>>> rebooting our cluster nodes every day to keep things going. So any
>>>>>>>> suggestions are appreciated.
>>>>>>>>
>>>>>>>> Thanks, Glenn
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> $ ompi_info
>>>>>>>> Open MPI: 1.2.1r14096-ct7b030r1838
>>>>>>>> Open MPI SVN revision: 0
>>>>>>>> Open RTE: 1.2.1r14096-ct7b030r1838
>>>>>>>> Open RTE SVN revision: 0
>>>>>>>> OPAL: 1.2.1r14096-ct7b030r1838
>>>>>>>> OPAL SVN revision: 0
>>>>>>>> Prefix: /opt/SUNWhpc/HPC7.0
>>>>>>>> Configured architecture: i386-pc-solaris2.10
>>>>>>>> Configured by: root
>>>>>>>> Configured on: Fri Mar 30 13:40:12 EDT 2007
>>>>>>>> Configure host: burpen-csx10-0
>>>>>>>> Built by: root
>>>>>>>> Built on: Fri Mar 30 13:57:25 EDT 2007
>>>>>>>> Built host: burpen-csx10-0
>>>>>>>> C bindings: yes
>>>>>>>> C++ bindings: yes
>>>>>>>> Fortran77 bindings: yes (all)
>>>>>>>> Fortran90 bindings: yes
>>>>>>>> Fortran90 bindings size: trivial
>>>>>>>> C compiler: cc
>>>>>>>> C compiler absolute: /ws/ompi-tools/SUNWspro/SOS11/bin/cc
>>>>>>>> C++ compiler: CC
>>>>>>>> C++ compiler absolute: /ws/ompi-tools/SUNWspro/SOS11/bin/CC
>>>>>>>> Fortran77 compiler: f77
>>>>>>>> Fortran77 compiler abs: /ws/ompi-tools/SUNWspro/SOS11/bin/f77
>>>>>>>> Fortran90 compiler: f95
>>>>>>>> Fortran90 compiler abs: /ws/ompi-tools/SUNWspro/SOS11/bin/f95
> >>>>>>> C profiling: yes
>>>>>>>>
>>>>>>>>
>>> >>>>> C++ profiling: yes
>>>
>>>
>>>>>>>> Fortran77 profiling: yes
>>>>>>>> Fortran90 profiling: yes
>>>>>>>> C++ exceptions: yes
>>>>>>>> Thread support: no
>>>>>>>> Internal debug support: no
>>>>>>>>
>>>>>>>>
>>> >>>>> MPI parameter check: runtime
>>>
>>>
>>>>>>>> Memory profiling support: no
>>>>>>>> Memory debugging support: no
>>>>>>>> libltdl support: yes
>>>>>>>> Heterogeneous support: yes
>>>>>>>> mpirun default --prefix: yes
>>>>>>>> MCA backtrace: printstack (MCA v1.0, API v1.0,
>>>>>>>> Component v1.2.1)
>>>>>>>> MCA paffinity: solaris (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA maffinity: first_use (MCA v1.0, API v1.0,
>>>>>>>> Component v1.2.1)
>>>>>>>> MCA timer: solaris (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA allocator: basic (MCA v1.0, API v1.0, Component
>>>>>>>> v1.0)
>>>>>>>> MCA allocator: bucket (MCA v1.0, API v1.0, Component
>>>>>>>> v1.0)
>>>>>>>> MCA coll: basic (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA coll: self (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.1)
>>>>>>>> MCA coll: tuned (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA io: romio (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.1)
>>>>>>>> MCA mpool: udapl (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.1)
>>>>>>>> MCA pml: ob1 (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.1)
>>>>>>>> MCA rcache: rb (MCA v1.0, API v1.0, Component v1.2.1)
>>>>>>>> MCA rcache: vma (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA btl: self (MCA v1.0, API v1.0.1, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA btl: sm (MCA v1.0, API v1.0.1, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA btl: tcp (MCA v1.0, API v1.0.1, Component
>>>>>>>> v1.0)
>>>>>>>> MCA btl: udapl (MCA v1.0, API v1.0, Component
> >>>>>>> v1.2.1)
>>>>>>>> MCA topo: unity (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA osc: pt2pt (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA errmgr: hnp (MCA v1.0, API v1.3, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA errmgr: orted (MCA v1.0, API v1.3, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA errmgr: proxy (MCA v1.0, API v1.3, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA gpr: null (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA gpr: proxy (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA gpr: replica (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA iof: proxy (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA iof: svc (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA ns: proxy (MCA v1.0, API v2.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA ns: replica (MCA v1.0, API v2.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
>>>>>>>> MCA ras: dash_host (MCA v1.0, API v1.3,
>>>>>>>> Component v1.2.1)
>>>>>>>> MCA ras: gridengine (MCA v1.0, API v1.3,
>>>>>>>> Component v1.2.1)
>>>>>>>> MCA ras: localhost (MCA v1.0, API v1.3,
>>>>>>>> Component v1.2.1)
>>>>>>>> MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.1)
>>>>>>>> MCA rds: hostfile (MCA v1.0, API v1.3,
>>>>>>>> Component v1.2.1)
>>>>>>>> MCA rds: proxy (MCA v1.0, API v1.3, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA rds: resfile (MCA v1.0, API v1.3, Component
> >>>>>>> v1.2.1)
>>>>>>>> MCA rmaps: round_robin (MCA v1.0, API v1.3,
>>>>>>>> Component v1.2.1)
>>>>>>>> MCA rmgr: proxy (MCA v1.0, API v2.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA rmgr: urm (MCA v1.0, API v2.0, Component
>>>>>>>>
>>>>>>>>
>>> >>>>> v1.2.1)
>>>
>>>
>>>>>>>> MCA rml: oob (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA pls: gridengine (MCA v1.0, API v1.3,
>>>>>>>> Component v1.2.1)
>>>>>>>> MCA pls: proxy (MCA v1.0, API v1.3, Component
>>>>>>>>
>>>>>>>>
>>> >>>>> v1.2.1)
>>>
>>>
>>>>>>>> MCA pls: rsh (MCA v1.0, API v1.3, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.1)
>>>>>>>> MCA sds: env (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA sds: pipe (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA sds: seed (MCA v1.0, API v1.0, Component
>>>>>>>> v1.2.1)
>>>>>>>> MCA sds: singleton (MCA v1.0, API v1.0,
>>>>>>>> Component v1.2.1)
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>_______________________________________________
>>>users mailing list
>>>users_at_[hidden]
>>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>
>>_______________________________________________
>>users mailing list
>>users_at_[hidden]
>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>_______________________________________________
>users mailing list
>users_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/users