Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI devel] BTL add procs errors
From: Sylvain Jeaugey (sylvain.jeaugey_at_[hidden])
Date: 2010-05-25 05:10:02


Hi,

I'm currently trying to have Open MPI exit more gracefully when a BTL
returns an error during the "add procs" phase.

The current bml/r2 code silently ignores btl->add_procs() error codes with
the following comment :
---- ompi/mca/bml/r2/bml_r2.c:208 ----
   /* This BTL has troubles adding the nodes. Let's continue maybe some other BTL
    * can take care of this task. */
   continue;
--------------------------------------

This seems wrong to me : either a proc is reached (the "reachable" bit
field is therefore updated), either it is not (and nothing is done). Any
error code should denote a fatal error needing a clean abort.

In the current openib btl code, the "reachable" bit is set but an error is
returned - then ignored by r2. The next call to the openib BTL results in
a segmentation fault.

So, maybe this simple fix would do the trick :
========================================================================
diff -r 96e0793d7885 ompi/mca/bml/r2/bml_r2.c
--- a/ompi/mca/bml/r2/bml_r2.c Wed May 19 14:35:27 2010 +0200
+++ b/ompi/mca/bml/r2/bml_r2.c Tue May 25 10:54:19 2010 +0200
@@ -210,7 +210,7 @@
              /* This BTL has troubles adding the nodes. Let's continue maybe some other BTL
               * can take care of this task.
               */
- continue;
+ return rc;
          }

          /* for each proc that is reachable */
========================================================================

Does anyone see a case (with a specific btl) where add_procs returns an
error but we still want to continue ?

Sylvain