Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.7.4rc: yet another launch failure
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-24 14:18:18


Granted - cmr'd to 1.7.5 with you set to review

On Jan 23, 2014, at 7:35 AM, Nathan Hjelm <hjelmn_at_[hidden]> wrote:

> I agree. A configure option to disable the use of getpwuid would be
> great as it is one of those functions that can never be static. getpwuid
> also fails for no particular reason on at least one XC30.
>
> -Nathan
>
> On Wed, Jan 22, 2014 at 08:57:20PM -0800, Ralph Castain wrote:
>> Interesting - still, I see no reason for OMPI to fail just because of
>> that. We can run just fine with the uid, so I'll make things a little more
>> flexible.
>> Thanks for tracking it down!
>> On Jan 22, 2014, at 7:54 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>>
>> Not lacking getpwuid():
>> [phh1_at_biou2 BLD]$ grep HAVE_GETPWUID */include/*_config.h
>> opal/include/opal_config.h:#define HAVE_GETPWUID 1
>> I also can't see why the quoted code could fail.
>> The following is working fine:
>> [phh1_at_biou2 BLD]$ cat q.c
>> #include <stdio.h>
>> #include <unistd.h>
>> #include <sys/types.h>
>> #include <pwd.h>
>> int main(void) {
>> uid_t uid = getuid();
>> printf("uid = %d\n", (int)uid);
>> struct passwd *p = getpwuid(uid);
>> if (p) printf("name = %s\n", p->pw_name);
>> return 0;
>> }
>> [phh1_at_biou2 BLD]$ gcc -std=c99 q.c && ./a.out
>> uid = 44154
>> name = phh1
>> HOWEVER, building for ILP32 target (as in the reported failure) fails:
>> [phh1_at_biou2 BLD]$ gcc -m32 -std=c99 q.c && ./a.out
>> uid = 44154
>> So, I am going to guess that this *is* a system misconfiguration (maybe
>> missing the 32-bit foo.so for the appropriate nsswitch resolver?) just
>> as the error message said.
>> Sorry for the false alarm,
>> -Paul
>>
>> On Wed, Jan 22, 2014 at 7:36 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>> Here is the offending code:
>> /* get the name of the user */
>> uid = getuid();
>> #ifdef HAVE_GETPWUID
>> pwdent = getpwuid(uid);
>> #else
>> pwdent = NULL;
>> #endif
>> if (NULL != pwdent) {
>> user = strdup(pwdent->pw_name);
>> } else {
>> orte_show_help("help-orte-runtime.txt",
>> "orte:session:dir:nopwname", true);
>> return ORTE_ERR_OUT_OF_RESOURCE;
>> }
>> Is it possible on this platform that you don't have getpwuid? I'm
>> surprised at the code as we could just use the uid instead - not sure
>> why this more stringent test was applied
>> On Jan 22, 2014, at 7:02 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>>
>> On yet another test platform I see the following:
>> $ mpirun -mca btl sm,self -np 1 examples/ring_c
>> --------------------------------------------------------------------------
>> Open MPI was unable to obtain the username in order to create a path
>> for its required temporary directories. This type of error is
>> usually
>> caused by a transient failure of network-based authentication
>> services
>> (e.g., LDAP or NIS failure due to network congestion), but can also
>> be
>> an indication of system misconfiguration.
>> Please consult your system administrator about these issues and try
>> again.
>> --------------------------------------------------------------------------
>> [biou2.rice.edu:30021] [[40214,0],0] ORTE_ERROR_LOG: Out of resource
>> in file
>> /home/phh1/SCRATCH/OMPI/openmpi-1.7-latest-linux-ppc32-xlc-11.1/openmpi-1.7.4rc2r30361/orte/util/session_dir.c
>> at line 380
>> [biou2.rice.edu:30021] [[40214,0],0] ORTE_ERROR_LOG: Out of resource
>> in file
>> /home/phh1/SCRATCH/OMPI/openmpi-1.7-latest-linux-ppc32-xlc-11.1/openmpi-1.7.4rc2r30361/orte/mca/ess/hnp/ess_hnp_module.c
>> at line 599
>> --------------------------------------------------------------------------
>> It looks like orte_init failed for some reason; your parallel
>> process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems. This failure appears to be an internal
>> failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>> orte_session_dir failed
>> --> Returned value Out of resource (-2) instead of ORTE_SUCCESS
>> --------------------------------------------------------------------------
>> An "-np 2" run fails in the same manner.
>> This is a production system and there is no problem with "whoami" or
>> "id", leaving me doubting the explanation provided by the error
>> message.
>> [phh1_at_biou2 ~]$ whoami
>> phh1
>> [phh1_at_biou2 ~]$ id
>> uid=44154(phh1) gid=2016(hpc)
>> groups=2016(hpc),3803(hpcusers),3805(sshgw),3808(biou)
>> The "ompi_info --all" output is attached.
>> Please let me know what additional info is needed.
>> -Paul
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden]
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> <biou2_info.txt.bz2>_______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden]
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel