> Just to say that I built the NetBSD OpenMPI 1.4 port from the CVS,
> so includsing all the recent work and get the exmaples to run, albeit
> still with the:
>
> opal_sockaddr2str failed:Unknown error (return code 4)
>
> non-fatal errors.
>
> As promised, I'll do bit more digging into this.
Here's the result of me "fancying a dig":
The software I was adding on top of OpenMPI, initially PETSc, and
above that PISM, has exhibited errors when run within an SGE/OpenMPI
environment when FOUR or EIGHT processors are used, but not TWO
The codes run when 2 or 4 processes are run on a single machine
outside of SGE.
I added a bit of debugging code into the
opal/util/net.c:opal_net_get_hostname()
routine.
--- opal-util-net.c.000 2009-12-17 13:55:18.000000000 +1300
+++ opal-util-net.c 2009-12-17 14:24:08.000000000 +1300
@@ -369,6 +369,10 @@
return NULL;
}
+ /* KMB */
+ opal_output(0, "KMB: addr.sa_len %d, addr->sa_family %d, addrlen %d\n",
+ addr->sa_len, addr->sa_family, addrlen ) ;
+ /* KMB */
error = getnameinfo(addr, addrlen,
name, NI_MAXHOST, NULL, 0, NI_NUMERICHOST);
Here's what I see, from stderr, when running the SkaMPI 5 test:
skampi -i ski/skampi_pt2pt.ski
across a 4-node SGE submission.
The SkaMPI test runs through by the way.
[khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
[khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
[khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
[khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
[khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
[matterhorn.ecs.vuw.ac.nz:09698] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[matterhorn.ecs.vuw.ac.nz:09698] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[kipp-cafe.ecs.vuw.ac.nz:27796] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[kipp-cafe.ecs.vuw.ac.nz:27796] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[old-bailey.ecs.vuw.ac.nz:27294] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[old-bailey.ecs.vuw.ac.nz:27294] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
[kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16
[khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Temporary failure in
name resolution (return code 4)
[matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 0, addr.sa_family 2
addrlen 16
[kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2
addrlen 16
[matterhorn.ecs.vuw.ac.nz:06159] opal_sockaddr2str failed:Temporary
failure in name resolution (return code 4)
[kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Temporary failure
in name resolution (return code 4)
[khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16
[khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Unknown error (return
code 4)
[kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2
addrlen 16
[kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Unknown error
(return code 4)
[old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 0, addr.sa_family 2
addrlen 16
[old-bailey.ecs.vuw.ac.nz:28315] opal_sockaddr2str failed:Temporary
failure in name resolution (return code 4)
[khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16
[khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Unknown error (return
code 4)
[matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 0, addr.sa_family 2
addrlen 16
[matterhorn.ecs.vuw.ac.nz:06159] opal_sockaddr2str failed:Unknown error
(return code 4)
[kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2
addrlen 16
[kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Unknown error
(return code 4)
[khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
[matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
[old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 16, addr.sa_family 2
addrlen 16
You'll notice that at least one "addr" that is making it's way into
opal_net_get_hostname
has an sa_len of zero and that that is what seems to be triggering
the
opal_sockaddr2str
messages.
I was wondering whether this was coming out of the IPv6 getifaddr
loop, as I thought I'd set everything explictly in the munged IPv4
stanza.
I'd like to "tidy up" those messages, if only because failing with
bith an unknown error and a temporay failure doesn't seem right !
Any thoughts welcome,
Kevin
--
Kevin M. Buckley Room: CO327
School of Engineering and Phone: +64 4 463 5971
Computer Science
Victoria University of Wellington
New Zealand
|