On Aug 24, 2007, at 4:18 PM, Josh Aune wrote:
> We are using open-mpi on several 1000+ node clusters. We received
> several new clusters using the Infiniserve 3.X software stack recently
> and are having several problems with the vapi btl (yes, I know, it is
> very very old and shouldn't be used. I couldn't agree with you more
> but those are my marching orders).
Thankfully, Infiniserve is not within my prevue. But -- FWIW -- you
should be using OFED. :-) (I know you know)
> I have a new application that is running into swap for an unknown
> reason. If I run and force it to use the tcp btl I don't seem to run
> into swap (the job just takes a very very long time). I have tried
> restricting the size of the free lists, forcing to use send mode, and
> use an open-mpi compiled w/ no memory manager but nothing seems to
> help. I've profiled with valgrind --tool=massif and the memtrace
> capabilities of ptmalloc but I don't have any smoking guns yet. It is
> a fortran app an I don't know anything about debugging fortran memory
> problems, can someone point me in the proper direction?
Hmm. If you compile Open MPI with no memory manager, then it
*shouldn't* be Open MPI's fault (unless there's a leak in the mvapi
BTL...?). Verify that you did not actually compile Open MPI with a
memory manager by running "ompi_info| grep ptmalloc2" -- it should
come up empty.
The fact that you can run this under TCP without memory leaking would
seem to indicate that it's not the app that's leaking memory, but
rather either the MPI or the network stack.