Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI portability problems: debug info isn't helpful
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-10-13 08:31:30


On Oct 11, 2008, at 10:20 AM, Aleksej Saushev wrote:

> $ ompi_info | grep oob
> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
> MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.7)

Good!

>> Is there a chance that there's some dependent library of oob_rml
>> that is available on your head/build node, but not available on
>> your back-
>> end nodes? (that would be pretty odd, though)
>
> Very unlikely. Unless you don't install it at "make install" time,
> it is there. Host and target are the same (identical).
> Any particular library (set of libraries) to check?

Actually, the output below seems to indicate that the modules are
being *loaded* ok, but they're declining to run for some reason. So I
think we can rule out the dependent libraries issue.

>> $ mpirun --mca rml_base_debug 100 -np 2 skosfile
> [asau.local:09060] mca: base: components_open: Looking for rml
> components
> [asau.local:09060] mca: base: components_open: distilling rml
> components
> [asau.local:09060] mca: base: components_open: accepting all rml
> components
> [asau.local:09060] mca: base: components_open: opening rml components
> [asau.local:09060] mca: base: components_open: found loaded
> component oob
> [asau.local:09060] mca: base: components_open: component oob open
> function successful
> [asau.local:09060] orte_rml_base_select: initializing rml component
> oob
> [asau.local:09060] orte_rml_base_select: init returned failure

Ah ha -- this is progress. For some reason, your "oob" RML plugin is
declining to run. I see that its query/initialization function is
actually quite short:

     if(mca_oob_base_init() != ORTE_SUCCESS)
         return NULL;
     *priority = 1;
     return &orte_rml_oob_module;

So it must be failing the mca_oob_base_init() function -- this is what
initializes the underling "OOB" (out of band) communications subsystem.

Of course, this doesn't fail often, so we don't have any run-time
switches to enable the debugging output. :-( Edit orte/mca/oob/base/
oob_base_open.c line 43 and change the value of mca_oob_base_output
from -1 to 0. Let's see that output -- I'm particularly interested in
the output from querying the tcp oob component. I suspect that it's
declining to run as well.

I wonder if this is going to end up being an opal_if() issue -- where
we are traversing all the IP network interfaces from the kernel...
I'll bet even money that it is.

Specifically: I predict that the tcp oob component is declining to run
(which then causes the greater OOB init to fail, because no OOB
components will be able to be found, which then causes the RML OOB
init to fail, and therefore RML init fails because no RML components
can be found). My guess is that orte/mca/oob/tcp/
oob_tcp.c:oob_tcp_component_init() is failing to find any valid/UP IP
interfaces. It starts traversing the list of interfaces at line 864
with the call to opal_ifbegin() ("OPAL" is our underlying portability
layer). If this was the first time opal_ifbegin() was invoked, it'll
scan the kernel for all the interfaces; otherwise it'll just traverse
the list that it already has. Either way, you might want to run this
section through a debugger and see if it's not finding anything.

Just an offhand question: do you have non-localhost IPv4 interfaces
enabled on your machines?

>>> That's also odd. I don't see any problems in the source code in
>> this particular area. What is the output of this area of the
>> code when compiled with -E? It should show some obvious
>> problem.
>
> I'll check this a bit later, if you don't object.

No problem.

-- 
Jeff Squyres
Cisco Systems