Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Rainer Keller (Keller_at_[hidden])
Date: 2006-01-19 11:12:08

Hello dear all,

George's patch svn:open-mpi r8741 makes the dead-lock, experienced on a
threaded build without this patch the on the mpi_test_suite sometimes go away
(compiled with --enable-progress-threads)

Previously, we would hang here:

mpirun -np 2 ./mpi_test_suite -r FULL -c MPI_COMM_WORLD -d MPI_INT

P2P tests Ring (3/31), comm MPI_COMM_WORLD (1/1), type MPI_INT (6/1)
[... Tests snipped ...]
P2P tests Alltoall with MPI_Probe (MPI_ANY_SOURCE) (20/31), comm
MPI_COMM_WORLD (1/1), type MPI_INT (6/1)
Collective tests Bcast (23/31), comm MPI_COMM_WORLD (1/1), type MPI_INT (6/1)
Here we used to always hang.

Now, we get through most of the times (9 out of 10).
This is all without the below patch.


On Wednesday 18 January 2006 22:39, Brian Barrett wrote:
> > Occurrences:
> > ompi/class/ompi_free_list.h
> This is ok as is, because the loop protecting against a spurious
> wakeup is already there. If two threads are waiting, both are woken
> up, and there's only one request (or somehow, no requests), then
> they'll try to remove from the list, get NULL, and continue through
> the bigger while() loop. So that works as expected.
> > opal/class/opal_free_list.h
> Same reasoning as ompi_free_list.
> > ompi/request/req_wait.c /* Two Occurences: not a
> > must, but... */
> I believe these are both correct. The first is in a larger do { ...}
> while loop that will handle the case of a wakeup with no requests
> ready. The second is in a tight while() loop already, so we're ok
> there.
> > orte/mca/gpr/proxy/gpr_proxy_compound_cmd.c
> This one I'd like Ralph to look at, because I"m not sure I understand
> the logic completely. It looks like this is potentially a problem.
> Only one thread will be woken up at a time, since the mutex has to be
> re-acquired. So the question becomes, will anyone give up the mutex
> with component.compound_cmd_mode left set to true, and I think the
> answer is yes. This looks like it could be a possible bug if people
> are using the compound command code when multiple threads are
> active. Thankfully, I don't think this happens very often.
> > orte/mca/iof/base/iof_base_flush.c:108
> This looks like it's wrapped in a larger while loop and is safe from
> any restart wait conditions.
> > orte/mca/pls/rsh/pls_rsh_module.c:892
> This could be a bit of a problem, but I don't think spurious wake-ups
> will cause any real problems. The worst case is that possibly we end
> up trying to concurrently start more processes than we really
> intended. But Tim might have more insight than I do.
> Just my $0.02
> Brian
> _______________________________________________
> devel mailing list
> devel_at_[hidden]

Dipl.-Inf. Rainer Keller       email: keller_at_[hidden]
  High Performance Computing     Tel: ++49 (0)711-685 5858
    Center Stuttgart (HLRS)        Fax: ++49 (0)711-685 5832
  POSTAL:Nobelstrasse 19   
  ACTUAL:Allmandring 30, R. O.030      AIM:rusraink
  70550 Stuttgart