Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] bug in mca framework?
From: Mike Dubman (miked_at_[hidden])
Date: 2013-12-03 07:53:29


Hi,
We observe crash during shmem_finalize() (in trunk) with new MCA framework.
After investigation, found that MCA tears-down process can access
previously released memory. (reproduced with oshmem_hello_c.c test)

0 0x00007fffed3d51d0 in ?? ()
#1 <signal handler called>
#2 <http://bgate.mellanox.com/redmine/issues/2> 0x00007ffff710e21e in
var_destructor (var=0x6fa7e0) at mca_base_var.c:1605
#3 <http://bgate.mellanox.com/redmine/issues/3> 0x00007ffff710ae99 in
opal_obj_run_destructors (object=0x6fa7e0) at
../../../opal/class/opal_object.h:448
#4 0x00007ffff710ca18 in mca_base_var_finalize () at mca_base_var.c:954
#5 <http://bgate.mellanox.com/redmine/issues/5> 0x00007ffff710a7e2 in
mca_base_param_finalize () at mca_base_param.c:643
#6 <http://bgate.mellanox.com/redmine/issues/6> 0x00007ffff70e08e2 in
opal_finalize_util () at runtime/opal_finalize.c:77
#7 <http://bgate.mellanox.com/redmine/issues/7> 0x00007ffff7aa5319 in
ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:407
#8 <http://bgate.mellanox.com/redmine/issues/8> 0x00007ffff7d900cc in
oshmem_shmem_finalize () at runtime/oshmem_shmem_finalize.c:75
#9 <http://bgate.mellanox.com/redmine/issues/9> 0x00007ffff7d91119 in
shmem_finalize () at shmem_finalize.c:24
#10 <http://bgate.mellanox.com/redmine/issues/10> 0x00007ffff7d89b8f in
__do_global_dtors_aux () from /install/lib/libshmem.so.0
#11 <http://bgate.mellanox.com/redmine/issues/11> 0x0000000000000000 in ??
()

The crash can be resolved by following patch:

diff --git a/opal/mca/base/mca_base_var.c b/opal/mca/base/mca_base_var.c
index 9966627..48028d8 100644
--- a/opal/mca/base/mca_base_var.c
+++ b/opal/mca/base/mca_base_var.c
@@ -773,7 +773,7 @@ static int var_find_by_name (const char *full_name, int
*index, bool invalidok)

     (void) var_get ((int)(uintptr_t) tmp, &var, false);

- if (invalidok || VAR_IS_VALID(var[0])) {
+ if (VAR_IS_VALID(var[0])) {
         *index = (int)(uintptr_t) tmp;
         return OPAL_SUCCESS;
     }

I`m not sure we understand yet why it fixes the problem and what is a race.
Could some` with knowledge of MCA flows look at it and comment?
The "invalidok" was introduced by Jeff`s commit.

Thanks

M