Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Pointers for understanding failure messages on NetBSD
From: Kevin.Buckley_at_[hidden]
Date: 2009-12-16 22:41:06


> Just to say that I built the NetBSD OpenMPI 1.4 port from the CVS,
> so includsing all the recent work and get the exmaples to run, albeit
> still with the:
>
> opal_sockaddr2str failed:Unknown error (return code 4)
>
> non-fatal errors.
>
> As promised, I'll do bit more digging into this.

Here's the result of me "fancying a dig":

The software I was adding on top of OpenMPI, initially PETSc, and
above that PISM, has exhibited errors when run within an SGE/OpenMPI
environment when FOUR or EIGHT processors are used, but not TWO

The codes run when 2 or 4 processes are run on a single machine
outside of SGE.

I added a bit of debugging code into the

opal/util/net.c:opal_net_get_hostname()

routine.

--- opal-util-net.c.000 2009-12-17 13:55:18.000000000 +1300
+++ opal-util-net.c 2009-12-17 14:24:08.000000000 +1300
@@ -369,6 +369,10 @@
         return NULL;
     }

+ /* KMB */
+ opal_output(0, "KMB: addr.sa_len %d, addr->sa_family %d, addrlen %d\n",
+ addr->sa_len, addr->sa_family, addrlen ) ;
+ /* KMB */
     error = getnameinfo(addr, addrlen,
                         name, NI_MAXHOST, NULL, 0, NI_NUMERICHOST);

Here's what I see, from stderr, when running the SkaMPI 5 test:

 skampi -i ski/skampi_pt2pt.ski

across a 4-node SGE submission.

The SkaMPI test runs through by the way.

[khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
[khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
[khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
[khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
[khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
[matterhorn.ecs.vuw.ac.nz:09698] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[matterhorn.ecs.vuw.ac.nz:09698] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[kipp-cafe.ecs.vuw.ac.nz:27796] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[kipp-cafe.ecs.vuw.ac.nz:27796] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[old-bailey.ecs.vuw.ac.nz:27294] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[old-bailey.ecs.vuw.ac.nz:27294] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
[kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16
[khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Temporary failure in
name resolution (return code 4)
[matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 0, addr.sa_family 2
addrlen 16
[kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2
addrlen 16
[matterhorn.ecs.vuw.ac.nz:06159] opal_sockaddr2str failed:Temporary
failure in name resolution (return code 4)
[kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Temporary failure
in name resolution (return code 4)
[khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16
[khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Unknown error (return
code 4)
[kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2
addrlen 16
[kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Unknown error
(return code 4)
[old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 0, addr.sa_family 2
addrlen 16
[old-bailey.ecs.vuw.ac.nz:28315] opal_sockaddr2str failed:Temporary
failure in name resolution (return code 4)
[khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16
[khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Unknown error (return
code 4)
[matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 0, addr.sa_family 2
addrlen 16
[matterhorn.ecs.vuw.ac.nz:06159] opal_sockaddr2str failed:Unknown error
(return code 4)
[kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2
addrlen 16
[kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Unknown error
(return code 4)
[khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
[matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16

You'll notice that at least one "addr" that is making it's way into

opal_net_get_hostname

has an sa_len of zero and that that is what seems to be triggering
the

opal_sockaddr2str

messages.

I was wondering whether this was coming out of the IPv6 getifaddr
loop, as I thought I'd set everything explictly in the munged IPv4
stanza.

I'd like to "tidy up" those messages, if only because failing with
bith an unknown error and a temporay failure doesn't seem right !

Any thoughts welcome,
Kevin

-- 
Kevin M. Buckley                                  Room:  CO327
School of Engineering and                         Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand