Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2006-01-19 11:22:02


Rainer,

I was hopping my patch solve the problem completely ... look like
it's not the case :( How exactly you get the dead-lock in the
mpi_test_suite ? Which configure options ? Only --enable-progress-
threads ?

   Thanks,
     george.

On Jan 19, 2006, at 11:12 AM, Rainer Keller wrote:

> Hello dear all,
>
> George's patch svn:open-mpi r8741 makes the dead-lock, experienced
> on a
> threaded build without this patch the on the mpi_test_suite
> sometimes go away
> (compiled with --enable-progress-threads)
>
> Previously, we would hang here:
>
> rusraink_at_pcglap12:~/WORK/OPENMPI/ompi-tests/mpi_test_suite/COMPILE-
> clean-threads>
> mpirun -np 2 ./mpi_test_suite -r FULL -c MPI_COMM_WORLD -d MPI_INT
>
> P2P tests Ring (3/31), comm MPI_COMM_WORLD (1/1), type MPI_INT (6/1)
> [... Tests snipped ...]
> P2P tests Alltoall with MPI_Probe (MPI_ANY_SOURCE) (20/31), comm
> MPI_COMM_WORLD (1/1), type MPI_INT (6/1)
> Collective tests Bcast (23/31), comm MPI_COMM_WORLD (1/1), type
> MPI_INT (6/1)
> ...
> Here we used to always hang.
>
> Now, we get through most of the times (9 out of 10).
> This is all without the below patch.
>
> CU,
> Rainer
>
> On Wednesday 18 January 2006 22:39, Brian Barrett wrote:
>>> Occurrences:
>>> ompi/class/ompi_free_list.h
>>
>> This is ok as is, because the loop protecting against a spurious
>> wakeup is already there. If two threads are waiting, both are woken
>> up, and there's only one request (or somehow, no requests), then
>> they'll try to remove from the list, get NULL, and continue through
>> the bigger while() loop. So that works as expected.
>>
>>> opal/class/opal_free_list.h
>>
>> Same reasoning as ompi_free_list.
>>
>>> ompi/request/req_wait.c /* Two Occurences: not a
>>> must, but... */
>>
>> I believe these are both correct. The first is in a larger do { ...}
>> while loop that will handle the case of a wakeup with no requests
>> ready. The second is in a tight while() loop already, so we're ok
>> there.
>>
>>> orte/mca/gpr/proxy/gpr_proxy_compound_cmd.c
>>
>> This one I'd like Ralph to look at, because I"m not sure I understand
>> the logic completely. It looks like this is potentially a problem.
>> Only one thread will be woken up at a time, since the mutex has to be
>> re-acquired. So the question becomes, will anyone give up the mutex
>> with component.compound_cmd_mode left set to true, and I think the
>> answer is yes. This looks like it could be a possible bug if people
>> are using the compound command code when multiple threads are
>> active. Thankfully, I don't think this happens very often.
>>
>>> orte/mca/iof/base/iof_base_flush.c:108
>>
>> This looks like it's wrapped in a larger while loop and is safe from
>> any restart wait conditions.
>>
>>> orte/mca/pls/rsh/pls_rsh_module.c:892
>>
>> This could be a bit of a problem, but I don't think spurious wake-ups
>> will cause any real problems. The worst case is that possibly we end
>> up trying to concurrently start more processes than we really
>> intended. But Tim might have more insight than I do.
>>
>>
>> Just my $0.02
>>
>> Brian
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> --
> ---------------------------------------------------------------------
> Dipl.-Inf. Rainer Keller email: keller_at_[hidden]
> High Performance Computing Tel: ++49 (0)711-685 5858
> Center Stuttgart (HLRS) Fax: ++49 (0)711-685 5832
> POSTAL:Nobelstrasse 19 http://www.hlrs.de/people/keller
> ACTUAL:Allmandring 30, R. O.030 AIM:rusraink
> 70550 Stuttgart
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

"Half of what I say is meaningless; but I say it so that the other
half may reach you"
                                   Kahlil Gibran