Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Pointers for understanding failure messages on NetBSD
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-12-16 22:57:11


You could confirm that it is the IPv6 loop by simply disabling IPv6 support - configure with --disable-ipv6 and see if you still get the error messages

Thanks for continuing to pursue this!
Ralph

On Dec 16, 2009, at 8:41 PM, Kevin.Buckley_at_[hidden] wrote:

>> Just to say that I built the NetBSD OpenMPI 1.4 port from the CVS,
>> so includsing all the recent work and get the exmaples to run, albeit
>> still with the:
>>
>> opal_sockaddr2str failed:Unknown error (return code 4)
>>
>> non-fatal errors.
>>
>> As promised, I'll do bit more digging into this.
>
> Here's the result of me "fancying a dig":
>
> The software I was adding on top of OpenMPI, initially PETSc, and
> above that PISM, has exhibited errors when run within an SGE/OpenMPI
> environment when FOUR or EIGHT processors are used, but not TWO
>
> The codes run when 2 or 4 processes are run on a single machine
> outside of SGE.
>
>
> I added a bit of debugging code into the
>
> opal/util/net.c:opal_net_get_hostname()
>
> routine.
>
> --- opal-util-net.c.000 2009-12-17 13:55:18.000000000 +1300
> +++ opal-util-net.c 2009-12-17 14:24:08.000000000 +1300
> @@ -369,6 +369,10 @@
> return NULL;
> }
>
> + /* KMB */
> + opal_output(0, "KMB: addr.sa_len %d, addr->sa_family %d, addrlen %d\n",
> + addr->sa_len, addr->sa_family, addrlen ) ;
> + /* KMB */
> error = getnameinfo(addr, addrlen,
> name, NI_MAXHOST, NULL, 0, NI_NUMERICHOST);
>
>
> Here's what I see, from stderr, when running the SkaMPI 5 test:
>
> skampi -i ski/skampi_pt2pt.ski
>
> across a 4-node SGE submission.
>
> The SkaMPI test runs through by the way.
>
> [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
> [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
> [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
> [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
> [khmer.ecs.vuw.ac.nz:09293] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
> [matterhorn.ecs.vuw.ac.nz:09698] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [matterhorn.ecs.vuw.ac.nz:09698] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [kipp-cafe.ecs.vuw.ac.nz:27796] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [kipp-cafe.ecs.vuw.ac.nz:27796] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [old-bailey.ecs.vuw.ac.nz:27294] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [old-bailey.ecs.vuw.ac.nz:27294] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
> [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16
> [khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Temporary failure in
> name resolution (return code 4)
> [matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 0, addr.sa_family 2
> addrlen 16
> [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2
> addrlen 16
> [matterhorn.ecs.vuw.ac.nz:06159] opal_sockaddr2str failed:Temporary
> failure in name resolution (return code 4)
> [kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Temporary failure
> in name resolution (return code 4)
> [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16
> [khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Unknown error (return
> code 4)
> [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2
> addrlen 16
> [kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Unknown error
> (return code 4)
> [old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 0, addr.sa_family 2
> addrlen 16
> [old-bailey.ecs.vuw.ac.nz:28315] opal_sockaddr2str failed:Temporary
> failure in name resolution (return code 4)
> [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 0, addr.sa_family 2 addrlen 16
> [khmer.ecs.vuw.ac.nz:14828] opal_sockaddr2str failed:Unknown error (return
> code 4)
> [matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 0, addr.sa_family 2
> addrlen 16
> [matterhorn.ecs.vuw.ac.nz:06159] opal_sockaddr2str failed:Unknown error
> (return code 4)
> [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 0, addr.sa_family 2
> addrlen 16
> [kipp-cafe.ecs.vuw.ac.nz:25231] opal_sockaddr2str failed:Unknown error
> (return code 4)
> [khmer.ecs.vuw.ac.nz:14828] KMB: addr.sa_len 16, addr.sa_family 2 addrlen 16
> [matterhorn.ecs.vuw.ac.nz:06159] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [kipp-cafe.ecs.vuw.ac.nz:25231] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
> [old-bailey.ecs.vuw.ac.nz:28315] KMB: addr.sa_len 16, addr.sa_family 2
> addrlen 16
>
>
> You'll notice that at least one "addr" that is making it's way into
>
> opal_net_get_hostname
>
> has an sa_len of zero and that that is what seems to be triggering
> the
>
> opal_sockaddr2str
>
> messages.
>
> I was wondering whether this was coming out of the IPv6 getifaddr
> loop, as I thought I'd set everything explictly in the munged IPv4
> stanza.
>
> I'd like to "tidy up" those messages, if only because failing with
> bith an unknown error and a temporay failure doesn't seem right !
>
> Any thoughts welcome,
> Kevin
>
> --
> Kevin M. Buckley Room: CO327
> School of Engineering and Phone: +64 4 463 5971
> Computer Science
> Victoria University of Wellington
> New Zealand
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users