As mentioned in today's telecon, we at LANL are continuing to see hangs when
running even small jobs that involve shared memory in collective operations.
This has been the topic of discussion before, but I bring it up again
because (a) the problem is beginning to become epidemic across our
application codes, and (b) repeated testing provides more info and (most
importantly) confirms that this problem -does not- occur under 1.2.x - it is
strictly a 1.3.2 (we haven't checked to see if it is in 1.3.0 or 1.3.1)
The condition is caused when the application performs a loop over collective
operations such as MPI_Allgather, MPI_Reduce, and MPI_Bcast. This list is
not intended to be exhaustive, but only represents the ones for which we
have solid and repeatable data. The symptoms are a "hanging" job, typically
(but not always!) associated with fully-consumed memory. The loops do not
have to involve substantial amounts of memory (the Bcast loop hangs after
moving a whole 32Mbytes, total), nor involve high loop counts. They only
have to repeatedly call the collective.
Disabling the shared memory BTL is enough to completely resolve the problem.
However, this creates an undesirable performance penalty we would like to
avoid, if possible.
Our current solution is to use the "sync" collective to occasionally insert
an MPI_Barrier into the code "behind the scenes" - i.e., to add an
MPI_Barrier call every N number of calls to "problem" collectives. The
argument in favor of this was that the hang is caused by consuming memory
due to "unexpected messages", caused principally by the root process in the
collective running slower than other procs. Thus, the notion goes, the root
process continues to fall further and further behind, consuming ever more
memory until it simply cannot progress. Adding the barrier operation forced
the other procs to "hold" until the root process could catch up, thereby
relieving the memory backlog.
The sync collective has worked for us, but we are now finding a very
disconcerting behavior - namely, that the precise value of N required to
avoid hanging (a) is very, very sensitive and can still let the app hang
even by changing the value by small amounts, (b) flunctuates between runs on
an unpredictable basis, and (c) can be different for different collectives.
These new problems surfaced this week when we found that a job that
previously ran fine with one value of coll_sync_barrier_before suddenly hung
when a loop over MPI_Bcast was added to the code. Further investigation has
found that the value of N required to make the new loop work is
significantly different than the prior value that made Allgather work,
creating an exhaustive search for a "sweet spot" for N.
Clearly, as codes grow in complexity, this simply is not going to work.
It seems to me that we have to begin investigating -why- the 1.3.2 code is
encountering this problem whereas the 1.2.x code is not. From our rough
measurements, there is a some speed difference between the two releases, so
perhaps we are now getting fast enough to create the problem - I don't think
we know enough yet to really claim this is true. At this time, we really
don't know -why- one process is running slow, or even if it is -always- the
root process that is doing so...nor have we confirmed (to my knowledge) that
our original analysis of the problem is correct!
We would appreciate any help with this problem. I gathered from today's
telecon that others are also encountering this, so perhaps there is enough
general pain to stimulate a team effort to resolve it!