Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI portability problems: debug info isn't helpful
From: Aleksej Saushev (asau_at_[hidden])
Date: 2008-10-16 13:39:08


Jeff Squyres <jsquyres_at_[hidden]> writes:

> On Oct 11, 2008, at 10:20 AM, Aleksej Saushev wrote:
>
>> $ ompi_info | grep oob
>> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
>> MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.7)
>
> Good!
>
>>> $ mpirun --mca rml_base_debug 100 -np 2 skosfile
>> [asau.local:09060] mca: base: components_open: Looking for rml
>> components
>> [asau.local:09060] mca: base: components_open: distilling rml
>> components
>> [asau.local:09060] mca: base: components_open: accepting all
>> rml components
>> [asau.local:09060] mca: base: components_open: opening rml components
>> [asau.local:09060] mca: base: components_open: found loaded
>> component oob
>> [asau.local:09060] mca: base: components_open: component oob
>> open function successful
>> [asau.local:09060] orte_rml_base_select: initializing rml
>> component oob
>> [asau.local:09060] orte_rml_base_select: init returned failure
>
> Ah ha -- this is progress. For some reason, your "oob" RML
> plugin is declining to run. I see that its
> query/initialization function is actually quite short:
>
> if(mca_oob_base_init() != ORTE_SUCCESS)
> return NULL;
> *priority = 1;
> return &orte_rml_oob_module;
>
> So it must be failing the mca_oob_base_init() function -- this
> is what initializes the underling "OOB" (out of band)
> communications subsystem.
>
> Of course, this doesn't fail often, so we don't have any
> run-time switches to enable the debugging output. :-( Edit
> orte/mca/oob/base/ oob_base_open.c line 43 and change the value
> of mca_oob_base_output from -1 to 0. Let's see that output --
> I'm particularly interested in the output from querying the tcp
> oob component. I suspect that it's declining to run as well.
>
> I wonder if this is going to end up being an opal_if() issue --
> where we are traversing all the IP network interfaces from the
> kernel... I'll bet even money that it is.

[asau.local:04648] opal_ifinit: ioctl(SIOCGIFFLAGS) failed with errno=6
[asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init_stage1.c at line 182
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_rml_base_select failed
  --> Returned value -13 instead of ORTE_SUCCESS

--------------------------------------------------------------------------
[asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_system_init.c at line 42
[asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52
--------------------------------------------------------------------------
Open RTE was unable to initialize properly. The error occured while
attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS.
--------------------------------------------------------------------------

Why don't you use strerror(3) to print errno value explanation?

>From <sys/errno.h>:
#define ENXIO 6 /* Device not configured */

It seems that I have to debug network interface probing,
how should I use *_output subroutines so that they do print?
I tried these changes but in vain:

--- opal/util/if.c.orig 2008-08-25 23:16:50.000000000 +0400
+++ opal/util/if.c 2008-10-15 23:55:07.000000000 +0400
@@ -242,6 +242,8 @@
         if(ifr->ifr_addr.sa_family != AF_INET)
             continue;
 
+ opal_output(0, "opal_ifinit: checking netif %s", ifr->ifr_name);
+ /* HERE IT FAILS!! */
         if(ioctl(sd, SIOCGIFFLAGS, ifr) < 0) {
             opal_output(0, "opal_ifinit: ioctl(SIOCGIFFLAGS) failed with errno=%d", errno);
             continue;
--- opal/util/if.c.orig 2008-08-25 23:16:50.000000000 +0400
+++ opal/util/if.c 2008-10-15 23:55:07.000000000 +0400
@@ -242,6 +242,8 @@
         if(ifr->ifr_addr.sa_family != AF_INET)
             continue;
 
+ fprintf(stderr, "opal_ifinit: checking netif %s\n", ifr->ifr_name);
+ /* HERE IT FAILS!! */
         if(ioctl(sd, SIOCGIFFLAGS, ifr) < 0) {
             opal_output(0, "opal_ifinit: ioctl(SIOCGIFFLAGS) failed with errno=%d", errno);
             continue;
--- opal/util/output.c.orig 2008-08-25 23:16:50.000000000 +0400
+++ opal/util/output.c 2008-10-16 19:58:49.000000000 +0400
@@ -41,7 +41,7 @@
 /*
  * Private data
  */
-static int verbose_stream = -1;
+static int verbose_stream = 0;
 static opal_output_stream_t verbose;
 static char *output_dir = NULL;
 static char *output_prefix = NULL;

It seems a bit tricky, and it is scarcely documented.
Have I overlooked it?

What makes it strange, that fprintf(stderr,..) doen't work.

> Specifically: I predict that the tcp oob component is declining
> to run (which then causes the greater OOB init to fail, because
> no OOB components will be able to be found, which then causes
> the RML OOB init to fail, and therefore RML init fails because
> no RML components can be found). My guess is that
> orte/mca/oob/tcp/ oob_tcp.c:oob_tcp_component_init() is failing
> to find any valid/UP IP interfaces. It starts traversing the
> list of interfaces at line 864 with the call to opal_ifbegin()
> ("OPAL" is our underlying portability layer). If this was the
> first time opal_ifbegin() was invoked, it'll scan the kernel
> for all the interfaces; otherwise it'll just traverse the list
> that it already has. Either way, you might want to run this
> section through a debugger and see if it's not finding anything.
>
> Just an offhand question: do you have non-localhost IPv4
> interfaces enabled on your machines?

Yes.

ifconfig -l ==> bce0 fwip0 rum0 lo0 pppoe0

>>>> That's also odd. I don't see any problems in the source code in
>>> this particular area. What is the output of this area of the
>>> code when compiled with -E? It should show some obvious
>>> problem.
>>
>> I'll check this a bit later, if you don't object.
>
> No problem.

I've met some difficulties on this way today. I take time for further
investigations. Though I think this isn't needed now.

I'll be unavailable starting from Saturday (probably,
since Monday for sure).

-- 
HE CE3OH...