Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2006-01-12 13:50:35


Which spurious wake-ups ? What is the problem with pthread_cond_wait ?

Anyway, my point is that the patch is not correct. It will lead to
deadlocks in nearly all the situations. Let's just look at the
req_wait.c file. Here is the diff:

Modified: trunk/ompi/request/req_wait.c
===================================================================
--- trunk/ompi/request/req_wait.c 2006-01-12 04:05:02 UTC (rev 8681)
+++ trunk/ompi/request/req_wait.c 2006-01-12 17:13:08 UTC (rev 8682)
@@ -66,7 +66,10 @@
      /* give up and sleep until completion */
      OPAL_THREAD_LOCK(&ompi_request_lock);
      ompi_request_waiting++;
- do {
+ /*
+ * We will break out of while{} as soon as all requests have
completed.
+ */
+ while (1) {
          rptr = requests;
          num_requests_null_inactive = 0;
          for (i = 0; i < count; i++, rptr++) {
@@ -87,10 +90,10 @@
          }
          if(num_requests_null_inactive == count)
              break;
- if (completed < 0) {
+ while (completed < 0) {
              opal_condition_wait(&ompi_request_cond,
&ompi_request_lock);
          }
- } while (completed < 0);
+ }
      ompi_request_waiting--;
      OPAL_THREAD_UNLOCK(&ompi_request_lock);

Imagine one thread goes in the ompi_request_wait_any function. It set
completed to -1 (initialization) then it test the status of the
requests ... and let's suppose they are all actives. completed is
still set to -1 in this case. Then the thread reach the for loop
around the requests starting at line 72. As any of the requests are
completed when it goes out of this loop completed is still set to -1.
Now the thread will reach the
+ while (completed < 0) {
              opal_condition_wait(&ompi_request_cond,
&ompi_request_lock);
As completed is a local variable the only thread that can modify it
is the thread that will be in the opal_condition_wait. As now we look
around the condition as long as the completed is less than zero ...
it look like we will loop there forever ... Basically, the if that
was there before allow the thread to redo the check every time it get
out from the opal_condition_wait, giving a chance to modify the
completed variable.

Similar behavior happens in all the patched places. Tell me if I'm
wrong ... otherwise the patch should be rolled-back ASAP as we are
not thread safe anymore ...

   Thanks,
     george.

On Jan 11, 2006, at 3:05 AM, Rainer Keller wrote:

> Hello dear all,
> I had a point on the tbd-list, that I would like to ask here:
> - Shouldn't we have a while-loop condition around every occurence
> of opal_condition_wait (spurious wake-ups)
> As we may do a pthread_cond_wait,
> e.g. in opal_free_list.h and OPAL_FREE_LIST_WAIT ?
>
> Occurrences:
> ompi/class/ompi_free_list.h
> opal/class/opal_free_list.h
> ompi/request/req_wait.c /* Two Occurences: not a
> must, but... */
> orte/mca/gpr/proxy/gpr_proxy_compound_cmd.c
> orte/mca/iof/base/iof_base_flush.c:108
> orte/mca/pls/rsh/pls_rsh_module.c:892
>
> May I check in the patch attached below?

"Half of what I say is meaningless; but I say it so that the other
half may reach you"
                                   Kahlil Gibran