Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Program hangs in mpi_bcast
From: Alex A. Granovsky (gran_at_[hidden])
Date: 2011-12-09 09:40:23

Dear Jeff,

thanks so much for your detailed reply and explanations and sorry for not answering sooner.

I'll try to develop reproducer and I have some ideas how this can be done.
At least I know typical scenarios causing this issue to appear. To be honest, I'm rather
busy these days (as probably most of us are), but I'll try to do this as soon as I can.

Just a brief comment on repeated collectives. I know at least two situations when repeated
collectives are either required or beneficial. First, the sizes of arrays to be (all)reduced
can be really large causing overflow of 32-bit integers so one has to split single operation
into a sequence of calls. I know some MPI implementations supports 64-bit integers as
arguments for extended set of functions handling large arrays, but some does not. In addition,
such a splitting reduces probability of hangs due to lack of resources on the compute nodes.

Second, our experiences with any transport, any MPI implementations and any CPU types
we tried so far show that the overall performance of (all)reduce is usually worse on very large
arrays as compared with that for a sequence of calls. While it is hard to predict the optimal size
of chunk, it can be easily found experimentally.

> > Some of our users would like to use Firefly with OpenMPI. Usually, we
> > simply answer them that OpenMPI is too buggy to be used.

> This surprises me. Is this with regards to this collective/hang issue, or something else?

Yes, this is with regards to collective hang issue.

All the best,

----- Original Message -----
From: "Jeff Squyres" <jsquyres_at_[hidden]>
To: "Alex A. Granovsky" <gran_at_[hidden]>;
Sent: Saturday, December 03, 2011 3:36 PM
Subject: Re: [OMPI users] Program hangs in mpi_bcast

On Dec 2, 2011, at 8:50 AM, Alex A. Granovsky wrote:

> I would like to start discussion on implementation of collective
> operations within OpenMPI. The reason for this is at least twofold.
> Last months, there was the constantly growing number of messages in
> the list sent by persons facing problems with collectives so I do
> believe these issues must be discussed and hopefully will finally
> attract proper attention of OpenMPI developers. The second one is my
> involvement in the development of Firefly Quantum Chemistry package,
> which, of course, uses collectives rather intensively.

Greetings Alex, and thanks for your note. We take it quite seriously, and had a bunch of phone/off-list conversations about it in
the past 24 hours.

Let me shed a little light on the history with regards to this particular issue...

- This issue was originally brought to light by LANL quite some time ago. They discovered that one of their MPI codes was hanging
in the middle of lengthy runs. After some investigation, it was determined that it was hanging in the middle of some collective
operations -- MPI_REDUCE, IIRC (maybe MPI_ALLREDUCE? For the purposes of this email, I'll assume MPI_REDUCE).

- It turns out that this application called MPI_REDUCE a *lot*. Which is not uncommon. However, it was actually a fairly poorly
architected application, such that it was doing things like repeatedly invoking MPI_REDUCE on single variables rather than bundling
them up into an array and computing them all with a single MPI_REDUCE (for example). Calling MPI_REDUCE a lot is not necessarily a
problem, per se, however -- MPI guarantees that this is supposed to be ok. I'll bring up below why I mention this specific point.

- After some investigating at LANL, they determined that putting a barrier in every N iterations caused the hangs to stop. A little
experimentation determined that running a barrier every 1000 collective operations both did not affect performance in any noticeable
way and avoided whatever the underlying problem was.

- The user did not want to add the barriers to their code, so we added another collective module that internally counts collective
operations and invokes a barrier every N iterations (where N is settable via MCA parameter). We defaulted N to 1000 because it
solved LANL's problems. I do not recall offhand whether we experimented to see if we could make N *more* than 1000 or not.

- Compounding the difficulty of this investigation was the fact that other Open MPI community members had an incredibly difficult
time reproducing the problem. I don't think that I was able to reproduce the problem at all, for example. I just took Ralph's old
reproducers and tried again, and am unable to make OMPI 1.4 or OMPI 1.5 hang. I actually modified his reproducers to make them a
bit *more* abusive (i.e., flood rank 0 with even *more* unexpected incoming messages), but I still can't get it to hang.

- To be clear: due to personnel changes at LANL at the time, there was very little experience in the MPI layer at LANL (Ralph, who
was at LANL at the time, is the ORTE guy -- he actively stays out of the MPI layer whenever possible). The application that
generated the problem was on restricted / un-shareable networks, so no one else in the OMPI community could see them. So:

  - no one else could replicate the problem
  - no OMPI layer expert could see the application that caused the problem

This made it *extremely* difficult to diagnose. As such, the barrier-every-N-iterations solution was deemed sufficient.

- There were some *suppositions* about what the real problem was, but we were never able to investigate properly, due to the
conditions listed above. The suppositions included:

  - some kind of race condition where an incoming message is dropped. This seemed unlikely, however, because if we were dropping
messages, that kind of problem should have showed up long ago
  - resource exhaustion. There are 3 documented issues with Open MPI running out of registered memory (one of which is just about
to get fixed). See: (this one is about to be fixed)

    It *could* be an issue with running out of registered memory, but preliminary investigation indicated that it *might* not have
been. However, this investigation was hampered by the factors above, and therefore was not completed (and therefore was not

FWIW, LANL now does have additional OMPI-level experts on staff, but the one problematic application that showed this behavior has
been re-written/modernized and no longer exhibits the problem. Hence, no one can justify reviving the old, complex, legacy code to
figure out what, if any, was the actual problem.

- Since no one else was able to replicate the problem, we determined that the barrier-every-N-iterations solution was sufficient.
We added the sync module to OMPI v1.4 and v1.5, and made it the default. It solved LANL's problems and didn't affect performance in
a noticeable way: problem solved, let's move on to the next issue.

- The most recent report about this issue had the user claim that they had to set the iteration count down to *5* (vs. 1000) before
their code worked right. This did set off alarm bells in my head -- *5* is waaaay too small of a number. That's why I specifically
asked if there was a way we could get a reproducer for that issue -- it would (hopefully) be a smoking gun pointing to whatever the
actual underlying issue was. Unfortunately, the user had a good enough solution and moved on, so a reproducer wasn't possible with
available resources. That being said, given that the number the user had to use was *5*, I wonder if there is some other problem /
race condition in the application itself. Keep in mind that just because an application runs with one MPI implementation doesn't
mean that it is correct / conformant. But without a detailed analysis of the problematic application code, it's impossible to say.

- Per the "the original LANL code was poorly architected" comment above, it falls into this same category: we don't actually know if
the application itself was correct. Since there were no MPI experts available at LANL at the time, the MPI application itself was
not analyzed to see if it, itself, was correct. To be clear: it is *possible* that OMPI is correct in hanging because the
application itself is invalid. That sounds like me avoiding responsibility, but it is a possibility that cannot be ignored. We've
run into *lots* of faulty use applications that, once corrected, run just fine. But that being said, we don't *know* that the
application was faulty (nor did we assume it) because a proper analysis was not able to be done both on that code or what was
happening inside OMPI. So we don't know where the problem was.

So -- that's why we are where we are today. Basically: a) this issue seemed to happen to a *very* small number of users, and b) no
one has created a reproducer that MPI experts can use to reliably diagnose the actual problem.

My only point in this lengthy recitation of history: there are (good) reasons why we are where we are.

All that being said, however, if a) and/or b) are incorrect -- e.g., if you have a simple reproducer code that can exhibit the
problem -- that would be *great*. I'd also apologize, because we honestly thought this was a problem that had affected a very small
number of people and that the coll sync workaround fixed the issue for everyone in an un-noticeable way.

> Some of our users would like to use Firefly with OpenMPI. Usually, we
> simply answer them that OpenMPI is too buggy to be used.

This surprises me. Is this with regards to this collective/hang issue, or something else? I don't see prior emails from you
indicating any specific bugs -- did I miss them? It would be good to get whatever the issues are fixed.

Do you have some specific issues that you could report to us?

More specifically, do you have a simple reproducer that shows the collective hangs when the coll sync module is disabled? That
would be most helpful.

If you're still reading this super lengthy email :-), many thanks for your time for a) reporting the issue, and b) reading my huge

Jeff Squyres
For corporate legal information go to: