Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Trunk is broken
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-12-25 16:27:21


FYI: this has been fixed and the temporary patch removed. Turned out to be a problem with progress threads not getting completely cleaned up prior to exit, resulting in multiple threads executing opal_finalize.

On Dec 24, 2012, at 10:43 AM, Ralph Castain <rhc_at_[hidden]> wrote:

> FWIW: I have installed a temporary patch that allows the trunk to run by no longer finalizing OPAL. Once the param system has been repaired, this will be removed. Meantime, at least you can run the trunk.
>
> On Dec 24, 2012, at 10:39 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> Hi folks
>>
>> This is a heads-up to all: It appears a recent commit has broken the trunk - I think it relates to something done to the MCA parameter system. When running across multiple nodes, the daemons segfault on finalize with a stacktrace of:
>>
>> (gdb) where
>> #0 0x0000003dc4477e92 in _int_free () from /lib64/libc.so.6
>> #1 0x00007f18a163f756 in param_destructor (p=0x118d940) at mca_base_param.c:1982
>> #2 0x00007f18a163ab41 in opal_obj_run_destructors (object=0x118d940) at ../../../opal/class/opal_object.h:448
>> #3 0x00007f18a163cb94 in mca_base_param_finalize () at mca_base_param.c:853
>> #4 0x00007f18a1609c06 in opal_finalize_util () at runtime/opal_finalize.c:69
>> #5 0x00007f18a1609cbc in opal_finalize () at runtime/opal_finalize.c:155
>> #6 0x00007f18a18e366b in orte_finalize () at runtime/orte_finalize.c:107
>> #7 0x00007f18a1911313 in orte_daemon (argc=35, argv=0x7ffffd7ea8b8) at orted/orted_main.c:834
>> #8 0x000000000040091a in main (argc=35, argv=0x7ffffd7ea8b8) at orted.c:62
>> (gdb) up
>> #1 0x00007f18a163f756 in param_destructor (p=0x118d940) at mca_base_param.c:1982
>> 1982 free(p->mbp_env_var_name);
>>
>> gdb) print array[i]
>> $2 = {mbp_super = {obj_magic_id = 0, obj_class = 0x7f18a18c6460, obj_reference_count = 1, cls_init_file_name = 0x7f18a169d04e "mca_base_param.c",
>> cls_init_lineno = 1154}, mbp_type = MCA_BASE_PARAM_TYPE_STRING, mbp_type_name = 0x1185110 "\300O\030\001", mbp_component_name = 0x0,
>> mbp_param_name = 0x1185130 "", mbp_full_name = 0x1185150 "orte_debugger_test_daemon", mbp_synonyms = 0x0, mbp_internal = false,
>> mbp_read_only = false, mbp_deprecated = false, mbp_deprecated_warning_shown = true,
>> mbp_help_msg = 0x11850a0 "Name of the executable to be used to simulate a debugger colaunch (relative or absolute path)",
>> mbp_env_var_name = 0x1185180 "\020P\030\001", mbp_default_value = {intval = 0, stringval = 0x0}, mbp_file_value_set = false, mbp_file_value = {
>> intval = 0, stringval = 0x0}, mbp_source_file = 0x0, mbp_override_value_set = false, mbp_override_value = {intval = 0, stringval = 0x0}}
>>
>> As you can see, the problem is that the mbp_env_var_name field is trash, so the destructor's attempt to free that field crashes.
>>
>> I believe it was Nathan that last touched this area, so perhaps he could take a gander and see what happened? Meantime, I'm afraid the trunk is down.
>>
>> Thanks
>> Ralph
>>
>