Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI portability problems: debug info isn'thelpful
From: Mike Hanby (mhanby_at_[hidden])
Date: 2008-10-17 10:25:41


Some further clarification, I read a post over on the SGE mailing list
that said the --with-sge is part of ompi 1.3, not 1.2.x.

-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
Behalf Of Aleksej Saushev
Sent: Thursday, October 16, 2008 12:39 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI portability problems: debug info
isn'thelpful

Jeff Squyres <jsquyres_at_[hidden]> writes:

> On Oct 11, 2008, at 10:20 AM, Aleksej Saushev wrote:
>
>> $ ompi_info | grep oob
>> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
>> MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.7)
>
> Good!
>
>>> $ mpirun --mca rml_base_debug 100 -np 2 skosfile
>> [asau.local:09060] mca: base: components_open: Looking for rml
>> components
>> [asau.local:09060] mca: base: components_open: distilling rml
>> components
>> [asau.local:09060] mca: base: components_open: accepting all
>> rml components
>> [asau.local:09060] mca: base: components_open: opening rml components
>> [asau.local:09060] mca: base: components_open: found loaded
>> component oob
>> [asau.local:09060] mca: base: components_open: component oob
>> open function successful
>> [asau.local:09060] orte_rml_base_select: initializing rml
>> component oob
>> [asau.local:09060] orte_rml_base_select: init returned failure
>
> Ah ha -- this is progress. For some reason, your "oob" RML
> plugin is declining to run. I see that its
> query/initialization function is actually quite short:
>
> if(mca_oob_base_init() != ORTE_SUCCESS)
> return NULL;
> *priority = 1;
> return &orte_rml_oob_module;
>
> So it must be failing the mca_oob_base_init() function -- this
> is what initializes the underling "OOB" (out of band)
> communications subsystem.
>
> Of course, this doesn't fail often, so we don't have any
> run-time switches to enable the debugging output. :-( Edit
> orte/mca/oob/base/ oob_base_open.c line 43 and change the value
> of mca_oob_base_output from -1 to 0. Let's see that output --
> I'm particularly interested in the output from querying the tcp
> oob component. I suspect that it's declining to run as well.
>
> I wonder if this is going to end up being an opal_if() issue --
> where we are traversing all the IP network interfaces from the
> kernel... I'll bet even money that it is.

[asau.local:04648] opal_ifinit: ioctl(SIOCGIFFLAGS) failed with errno=6
[asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init_stage1.c at line 182
------------------------------------------------------------------------

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
  orte_rml_base_select failed
  --> Returned value -13 instead of ORTE_SUCCESS
------------------------------------------------------------------------
--
[asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_system_init.c at line 42
[asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init.c at line 52
------------------------------------------------------------------------
--
Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -13 instead of ORTE_SUCCESS.
------------------------------------------------------------------------
--
Why don't you use strerror(3) to print errno value explanation?
>From <sys/errno.h>:
#define	ENXIO		6		/* Device not configured */
It seems that I have to debug network interface probing,
how should I use *_output subroutines so that they do print?
I tried these changes but in vain:
--- opal/util/if.c.orig	2008-08-25 23:16:50.000000000 +0400
+++ opal/util/if.c	2008-10-15 23:55:07.000000000 +0400
@@ -242,6 +242,8 @@
         if(ifr->ifr_addr.sa_family != AF_INET)
             continue;
 
+	opal_output(0, "opal_ifinit: checking netif %s", ifr->ifr_name);
+	/* HERE IT FAILS!! */
         if(ioctl(sd, SIOCGIFFLAGS, ifr) < 0) {
             opal_output(0, "opal_ifinit: ioctl(SIOCGIFFLAGS) failed
with errno=%d", errno);
             continue;
--- opal/util/if.c.orig	2008-08-25 23:16:50.000000000 +0400
+++ opal/util/if.c	2008-10-15 23:55:07.000000000 +0400
@@ -242,6 +242,8 @@
         if(ifr->ifr_addr.sa_family != AF_INET)
             continue;
 
+	fprintf(stderr, "opal_ifinit: checking netif %s\n",
ifr->ifr_name);
+	/* HERE IT FAILS!! */
         if(ioctl(sd, SIOCGIFFLAGS, ifr) < 0) {
             opal_output(0, "opal_ifinit: ioctl(SIOCGIFFLAGS) failed
with errno=%d", errno);
             continue;
--- opal/util/output.c.orig	2008-08-25 23:16:50.000000000 +0400
+++ opal/util/output.c	2008-10-16 19:58:49.000000000 +0400
@@ -41,7 +41,7 @@
 /*
  * Private data
  */
-static int verbose_stream = -1;
+static int verbose_stream = 0;
 static opal_output_stream_t verbose;
 static char *output_dir = NULL;
 static char *output_prefix = NULL;
It seems a bit tricky, and it is scarcely documented.
Have I overlooked it?
What makes it strange, that fprintf(stderr,..) doen't work.
> Specifically: I predict that the tcp oob component is declining
> to run  (which then causes the greater OOB init to fail, because
> no OOB  components will be able to be found, which then causes
> the RML OOB  init to fail, and therefore RML init fails because
> no RML components  can be found).  My guess is that
> orte/mca/oob/tcp/ oob_tcp.c:oob_tcp_component_init() is failing
> to find any valid/UP IP  interfaces.  It starts traversing the
> list of interfaces at line 864  with the call to opal_ifbegin()
> ("OPAL" is our underlying portability  layer).  If this was the
> first time opal_ifbegin() was invoked, it'll  scan the kernel
> for all the interfaces; otherwise it'll just traverse  the list
> that it already has.  Either way, you might want to run this
> section through a debugger and see if it's not finding anything.
>
> Just an offhand question: do you have non-localhost IPv4
> interfaces  enabled on your machines?
Yes.
ifconfig -l ==> bce0 fwip0 rum0 lo0 pppoe0
>>>> That's also odd.  I don't see any problems in the source code in
>>> this  particular area.  What is the output of this area of the
>>> code when  compiled with -E?  It should show some obvious
>>> problem.
>>
>> I'll check this a bit later, if you don't object.
>
> No problem.
I've met some difficulties on this way today. I take time for further
investigations. Though I think this isn't needed now.
I'll be unavailable starting from Saturday (probably,
since Monday for sure).
-- 
HE CE3OH...
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users