We are using open-mpi on several 1000+ node clusters. We received
several new clusters using the Infiniserve 3.X software stack recently
and are having several problems with the vapi btl (yes, I know, it is
very very old and shouldn't be used. I couldn't agree with you more
but those are my marching orders).
I have a new application that is running into swap for an unknown
reason. If I run and force it to use the tcp btl I don't seem to run
into swap (the job just takes a very very long time). I have tried
restricting the size of the free lists, forcing to use send mode, and
use an open-mpi compiled w/ no memory manager but nothing seems to
help. I've profiled with valgrind --tool=massif and the memtrace
capabilities of ptmalloc but I don't have any smoking guns yet. It is
a fortran app an I don't know anything about debugging fortran memory
problems, can someone point me in the proper direction?
Thanks,
Josh
|