Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Open MPI 1.3 segfault on amd64 with Rmpi
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-02-01 07:13:02


On Jan 30, 2009, at 4:54 PM, Dirk Eddelbuettel wrote:

> | > where things end in the loop over oapl_list() elements. I still
> see a
> | > fprintf() statment just before
> | >
> | > if (MCA_SUCCESS == component->mca_register_component_params()) {
> | >
> | > in the middle of the open_components function in the file
> | > mca_base_components_open.c
> |
> | Do you know if component is non-NULL and has a sensible value (i.e.,
> | pointing to a valid component)?
>
> Do not. Everything (in particular below /etc/openmpi/) is at default
> values
> with the sole exception of
>
> # edd 18 Dec 2008
> mca_component_show_load_errors = 0
>
> Could that kill it? [ Goes off and tests... ] No, still dies with
> segfault
> in open_components.

FWIW: mca_component_show_load_errors should only affect conditional
output of some warning messages.

> | Does ompi_info work? (ompi_info uses this exact same code to find/
> | open components) If ompi_info fails, you should be able to attach a
> | debugger to that, since it's a serial and [relatively]
> straightforward
> | app.
>
> Yes, ompi_info happily runs and returns around 111 lines. It seems
> to loop
> over around 25 mca components.
>
> Open MPI is otherwise healthy and happy. It's just that Rmpi does
> not get
> along with Open MPI 1.3 .... but this happens to be my personal use-
> case :-/

Quite puzzling. This portion of the code has already successfully
opened the components and is looping over a list of the components
that were found. It *sounds* like that list has somehow gotten
corrupted.

Is there any way you can check that the values of component and
component->mca_register_component_params are non-NULL / valid?

FWIW, component should be a pointer to the struct that we use to
represent plugins; it's a member of the list element from the list of
found components. Here's some code from right above the problematic
line:

     for (item = opal_list_get_first(src);
          opal_list_get_end(src) != item;
          item = opal_list_get_next(item)) {
         cli = (mca_base_component_list_item_t *) item;
         component = cli->cli_component;

So you might want to examine cli as well and ensure that it has
sensible values (the casting trick that we do is fairly common in the
OMPI code base -- the list item is the first data member of the
mca_base_component_list_item_t, so we can cast to/from it as required).

-- 
Jeff Squyres
Cisco Systems