On Dec 2, 2011, at 8:50 AM, Alex A. Granovsky wrote:
> I would like to start discussion on implementation of collective
> operations within OpenMPI. The reason for this is at least twofold.
> Last months, there was the constantly growing number of messages in
> the list sent by persons facing problems with collectives so I do
> believe these issues must be discussed and hopefully will finally
> attract proper attention of OpenMPI developers. The second one is my
> involvement in the development of Firefly Quantum Chemistry package,
> which, of course, uses collectives rather intensively.
Greetings Alex, and thanks for your note. We take it quite seriously, and had a bunch of phone/off-list conversations about it in the past 24 hours.
Let me shed a little light on the history with regards to this particular issue...
- This issue was originally brought to light by LANL quite some time ago. They discovered that one of their MPI codes was hanging in the middle of lengthy runs. After some investigation, it was determined that it was hanging in the middle of some collective operations -- MPI_REDUCE, IIRC (maybe MPI_ALLREDUCE? For the purposes of this email, I'll assume MPI_REDUCE).
- It turns out that this application called MPI_REDUCE a *lot*. Which is not uncommon. However, it was actually a fairly poorly architected application, such that it was doing things like repeatedly invoking MPI_REDUCE on single variables rather than bundling them up into an array and computing them all with a single MPI_REDUCE (for example). Calling MPI_REDUCE a lot is not necessarily a problem, per se, however -- MPI guarantees that this is supposed to be ok. I'll bring up below why I mention this specific point.
- After some investigating at LANL, they determined that putting a barrier in every N iterations caused the hangs to stop. A little experimentation determined that running a barrier every 1000 collective operations both did not affect performance in any noticeable way and avoided whatever the underlying problem was.
- The user did not want to add the barriers to their code, so we added another collective module that internally counts collective operations and invokes a barrier every N iterations (where N is settable via MCA parameter). We defaulted N to 1000 because it solved LANL's problems. I do not recall offhand whether we experimented to see if we could make N *more* than 1000 or not.
- Compounding the difficulty of this investigation was the fact that other Open MPI community members had an incredibly difficult time reproducing the problem. I don't think that I was able to reproduce the problem at all, for example. I just took Ralph's old reproducers and tried again, and am unable to make OMPI 1.4 or OMPI 1.5 hang. I actually modified his reproducers to make them a bit *more* abusive (i.e., flood rank 0 with even *more* unexpected incoming messages), but I still can't get it to hang.
- To be clear: due to personnel changes at LANL at the time, there was very little experience in the MPI layer at LANL (Ralph, who was at LANL at the time, is the ORTE guy -- he actively stays out of the MPI layer whenever possible). The application that generated the problem was on restricted / un-shareable networks, so no one else in the OMPI community could see them. So:
- no one else could replicate the problem
- no OMPI layer expert could see the application that caused the problem
This made it *extremely* difficult to diagnose. As such, the barrier-every-N-iterations solution was deemed sufficient.
- There were some *suppositions* about what the real problem was, but we were never able to investigate properly, due to the conditions listed above. The suppositions included:
- some kind of race condition where an incoming message is dropped. This seemed unlikely, however, because if we were dropping messages, that kind of problem should have showed up long ago
- resource exhaustion. There are 3 documented issues with Open MPI running out of registered memory (one of which is just about to get fixed). See:
https://svn.open-mpi.org/trac/ompi/ticket/2157 (this one is about to be fixed)
It *could* be an issue with running out of registered memory, but preliminary investigation indicated that it *might* not have been. However, this investigation was hampered by the factors above, and therefore was not completed (and therefore was not definitive).
FWIW, LANL now does have additional OMPI-level experts on staff, but the one problematic application that showed this behavior has been re-written/modernized and no longer exhibits the problem. Hence, no one can justify reviving the old, complex, legacy code to figure out what, if any, was the actual problem.
- Since no one else was able to replicate the problem, we determined that the barrier-every-N-iterations solution was sufficient. We added the sync module to OMPI v1.4 and v1.5, and made it the default. It solved LANL's problems and didn't affect performance in a noticeable way: problem solved, let's move on to the next issue.
- The most recent report about this issue had the user claim that they had to set the iteration count down to *5* (vs. 1000) before their code worked right. This did set off alarm bells in my head -- *5* is waaaay too small of a number. That's why I specifically asked if there was a way we could get a reproducer for that issue -- it would (hopefully) be a smoking gun pointing to whatever the actual underlying issue was. Unfortunately, the user had a good enough solution and moved on, so a reproducer wasn't possible with available resources. That being said, given that the number the user had to use was *5*, I wonder if there is some other problem / race condition in the application itself. Keep in mind that just because an application runs with one MPI implementation doesn't mean that it is correct / conformant. But without a detailed analysis of the problematic application code, it's impossible to say.
- Per the "the original LANL code was poorly architected" comment above, it falls into this same category: we don't actually know if the application itself was correct. Since there were no MPI experts available at LANL at the time, the MPI application itself was not analyzed to see if it, itself, was correct. To be clear: it is *possible* that OMPI is correct in hanging because the application itself is invalid. That sounds like me avoiding responsibility, but it is a possibility that cannot be ignored. We've run into *lots* of faulty use applications that, once corrected, run just fine. But that being said, we don't *know* that the application was faulty (nor did we assume it) because a proper analysis was not able to be done both on that code or what was happening inside OMPI. So we don't know where the problem was.
So -- that's why we are where we are today. Basically: a) this issue seemed to happen to a *very* small number of users, and b) no one has created a reproducer that MPI experts can use to reliably diagnose the actual problem.
My only point in this lengthy recitation of history: there are (good) reasons why we are where we are.
All that being said, however, if a) and/or b) are incorrect -- e.g., if you have a simple reproducer code that can exhibit the problem -- that would be *great*. I'd also apologize, because we honestly thought this was a problem that had affected a very small number of people and that the coll sync workaround fixed the issue for everyone in an un-noticeable way.
> Some of our users would like to use Firefly with OpenMPI. Usually, we
> simply answer them that OpenMPI is too buggy to be used.
This surprises me. Is this with regards to this collective/hang issue, or something else? I don't see prior emails from you indicating any specific bugs -- did I miss them? It would be good to get whatever the issues are fixed.
Do you have some specific issues that you could report to us?
More specifically, do you have a simple reproducer that shows the collective hangs when the coll sync module is disabled? That would be most helpful.
If you're still reading this super lengthy email :-), many thanks for your time for a) reporting the issue, and b) reading my huge reply!
For corporate legal information go to: