Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2006-01-18 16:39:19

On Jan 11, 2006, at 3:05 AM, Rainer Keller wrote:

> Hello dear all,
> I had a point on the tbd-list, that I would like to ask here:
> - Shouldn't we have a while-loop condition around every occurence
> of opal_condition_wait (spurious wake-ups)
> As we may do a pthread_cond_wait,
> e.g. in opal_free_list.h and OPAL_FREE_LIST_WAIT ?

I finally got a chance to look at this, and I think for the most part
we're ok. There are two that worry me, but I wanted Ralph and Tim to
weigh in before I did anything. More info below...

> Occurrences:
> ompi/class/ompi_free_list.h

This is ok as is, because the loop protecting against a spurious
wakeup is already there. If two threads are waiting, both are woken
up, and there's only one request (or somehow, no requests), then
they'll try to remove from the list, get NULL, and continue through
the bigger while() loop. So that works as expected.

> opal/class/opal_free_list.h

Same reasoning as ompi_free_list.

> ompi/request/req_wait.c /* Two Occurences: not a
> must, but... */

I believe these are both correct. The first is in a larger do { ...}
while loop that will handle the case of a wakeup with no requests
ready. The second is in a tight while() loop already, so we're ok

> orte/mca/gpr/proxy/gpr_proxy_compound_cmd.c

This one I'd like Ralph to look at, because I"m not sure I understand
the logic completely. It looks like this is potentially a problem.
Only one thread will be woken up at a time, since the mutex has to be
re-acquired. So the question becomes, will anyone give up the mutex
with component.compound_cmd_mode left set to true, and I think the
answer is yes. This looks like it could be a possible bug if people
are using the compound command code when multiple threads are
active. Thankfully, I don't think this happens very often.

> orte/mca/iof/base/iof_base_flush.c:108

This looks like it's wrapped in a larger while loop and is safe from
any restart wait conditions.

> orte/mca/pls/rsh/pls_rsh_module.c:892

This could be a bit of a problem, but I don't think spurious wake-ups
will cause any real problems. The worst case is that possibly we end
up trying to concurrently start more processes than we really
intended. But Tim might have more insight than I do.

Just my $0.02