Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] BTL add procs errors
From: Sylvain Jeaugey (sylvain.jeaugey_at_[hidden])
Date: 2010-05-25 05:10:02


Hi,

I'm currently trying to have Open MPI exit more gracefully when a BTL
returns an error during the "add procs" phase.

The current bml/r2 code silently ignores btl->add_procs() error codes with
the following comment :
---- ompi/mca/bml/r2/bml_r2.c:208 ----
   /* This BTL has troubles adding the nodes. Let's continue maybe some other BTL
    * can take care of this task. */
   continue;
--------------------------------------

This seems wrong to me : either a proc is reached (the "reachable" bit
field is therefore updated), either it is not (and nothing is done). Any
error code should denote a fatal error needing a clean abort.

In the current openib btl code, the "reachable" bit is set but an error is
returned - then ignored by r2. The next call to the openib BTL results in
a segmentation fault.

So, maybe this simple fix would do the trick :
========================================================================
diff -r 96e0793d7885 ompi/mca/bml/r2/bml_r2.c
--- a/ompi/mca/bml/r2/bml_r2.c Wed May 19 14:35:27 2010 +0200
+++ b/ompi/mca/bml/r2/bml_r2.c Tue May 25 10:54:19 2010 +0200
@@ -210,7 +210,7 @@
              /* This BTL has troubles adding the nodes. Let's continue maybe some other BTL
               * can take care of this task.
               */
- continue;
+ return rc;
          }

          /* for each proc that is reachable */
========================================================================

Does anyone see a case (with a specific btl) where add_procs returns an
error but we still want to continue ?

Sylvain