Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.7.4rc: yet another launch failure
From: Nathan Hjelm (hjelmn_at_[hidden])
Date: 2014-01-23 10:35:10


I agree. A configure option to disable the use of getpwuid would be
great as it is one of those functions that can never be static. getpwuid
also fails for no particular reason on at least one XC30.

-Nathan

On Wed, Jan 22, 2014 at 08:57:20PM -0800, Ralph Castain wrote:
> Interesting - still, I see no reason for OMPI to fail just because of
> that. We can run just fine with the uid, so I'll make things a little more
> flexible.
> Thanks for tracking it down!
> On Jan 22, 2014, at 7:54 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
> Not lacking getpwuid():
> [phh1_at_biou2 BLD]$ grep HAVE_GETPWUID */include/*_config.h
> opal/include/opal_config.h:#define HAVE_GETPWUID 1
> I also can't see why the quoted code could fail.
> The following is working fine:
> [phh1_at_biou2 BLD]$ cat q.c
> #include <stdio.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <pwd.h>
> int main(void) {
> uid_t uid = getuid();
> printf("uid = %d\n", (int)uid);
> struct passwd *p = getpwuid(uid);
> if (p) printf("name = %s\n", p->pw_name);
> return 0;
> }
> [phh1_at_biou2 BLD]$ gcc -std=c99 q.c && ./a.out
> uid = 44154
> name = phh1
> HOWEVER, building for ILP32 target (as in the reported failure) fails:
> [phh1_at_biou2 BLD]$ gcc -m32 -std=c99 q.c && ./a.out
> uid = 44154
> So, I am going to guess that this *is* a system misconfiguration (maybe
> missing the 32-bit foo.so for the appropriate nsswitch resolver?) just
> as the error message said.
> Sorry for the false alarm,
> -Paul
>
> On Wed, Jan 22, 2014 at 7:36 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> Here is the offending code:
> /* get the name of the user */
> uid = getuid();
> #ifdef HAVE_GETPWUID
> pwdent = getpwuid(uid);
> #else
> pwdent = NULL;
> #endif
> if (NULL != pwdent) {
> user = strdup(pwdent->pw_name);
> } else {
> orte_show_help("help-orte-runtime.txt",
> "orte:session:dir:nopwname", true);
> return ORTE_ERR_OUT_OF_RESOURCE;
> }
> Is it possible on this platform that you don't have getpwuid? I'm
> surprised at the code as we could just use the uid instead - not sure
> why this more stringent test was applied
> On Jan 22, 2014, at 7:02 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
> On yet another test platform I see the following:
> $ mpirun -mca btl sm,self -np 1 examples/ring_c
> --------------------------------------------------------------------------
> Open MPI was unable to obtain the username in order to create a path
> for its required temporary directories. This type of error is
> usually
> caused by a transient failure of network-based authentication
> services
> (e.g., LDAP or NIS failure due to network congestion), but can also
> be
> an indication of system misconfiguration.
> Please consult your system administrator about these issues and try
> again.
> --------------------------------------------------------------------------
> [biou2.rice.edu:30021] [[40214,0],0] ORTE_ERROR_LOG: Out of resource
> in file
> /home/phh1/SCRATCH/OMPI/openmpi-1.7-latest-linux-ppc32-xlc-11.1/openmpi-1.7.4rc2r30361/orte/util/session_dir.c
> at line 380
> [biou2.rice.edu:30021] [[40214,0],0] ORTE_ERROR_LOG: Out of resource
> in file
> /home/phh1/SCRATCH/OMPI/openmpi-1.7-latest-linux-ppc32-xlc-11.1/openmpi-1.7.4rc2r30361/orte/mca/ess/hnp/ess_hnp_module.c
> at line 599
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel
> process is
> likely to abort. There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal
> failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> orte_session_dir failed
> --> Returned value Out of resource (-2) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> An "-np 2" run fails in the same manner.
> This is a production system and there is no problem with "whoami" or
> "id", leaving me doubting the explanation provided by the error
> message.
> [phh1_at_biou2 ~]$ whoami
> phh1
> [phh1_at_biou2 ~]$ id
> uid=44154(phh1) gid=2016(hpc)
> groups=2016(hpc),3803(hpcusers),3805(sshgw),3808(biou)
> The "ompi_info --all" output is attached.
> Please let me know what additional info is needed.
> -Paul
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> <biou2_info.txt.bz2>_______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



  • application/pgp-signature attachment: stored