Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem: help-hostfile.txt: Too many open files in system.
From: Ralph Castain (rhc.openmpi_at_[hidden])
Date: 2013-01-10 10:34:22


What is even stranger is that the error occurs when attempting to launch a daemon! Does your program do a series of comm_spawns?

Sent from my iPad

On Jan 10, 2013, at 7:28 AM, "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]> wrote:

> That's a weird one -- it looks like having too many open files on your system is causing a cascading set of failures.
>
> Are you saying that your program runs for a while and then on iteration 32, it fails with errors like this? If so, I'd like for a file descriptor leak in your program.
>
>
> On Jan 4, 2013, at 12:48 PM, Mariana Vargas Magana <mmarianav_at_[hidden]> wrote:
>
>> Hello open MPI users:
>>
>> I was just running a program that usually works well in the cluster and suddenly in the 32 iteration I get this strange set of errors associated with. I will appreciate if someone could give me some hint of the problem and how to solve
>>
>> Thanks!
>>
>> Mariana
>>
>>
>> /usr/bin/ssh: error while loading shared libraries: libcrypt.so.1: cannot open shared object file: Error 23
>> /usr/bin/ssh: error while loading shared libraries: libutil.so.1: cannot open shared object file: Error 23
>> /usr/bin/ssh: error while loading shared libraries: libfipscheck.so.1: cannot open shared object file: Error 23
>> /usr/bin/ssh: error while loading shared libraries: libkrb5.so.3: cannot open shared object file: Error 23
>> --------------------------------------------------------------------------
>> A daemon (pid 1486) died unexpectedly with status 127 while attempting
>> to launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> Sorry! You were supposed to get help about:
>> no-hostfile
>> But I couldn't open the help file:
>> /home/mvargas/openmpi/share/openmpi/help-hostfile.txt: Too many open files in system. Sorry!
>> --------------------------------------------------------------------------
>> [ferrari:01490] [[65228,0],0] ORTE_ERROR_LOG: Not found in file base/ras_base_allocate.c at line 200
>> [ferrari:01490] [[65228,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 99
>> [ferrari:01490] [[65228,0],0] ORTE_ERROR_LOG: Not found in file plm_rsh_module.c at line 1167
>> --------------------------------------------------------------------------
>> Sorry! You were supposed to get help about:
>> no-hostfile
>> But I couldn't open the help file:
>> /home/mvargas/openmpi/share/openmpi/help-hostfile.txt: Too many open files in system. Sorry!
>> --------------------------------------------------------------------------
>> [ferrari:01491] [[65229,0],0] ORTE_ERROR_LOG: Not found in file base/ras_base_allocate.c at line 200
>> [ferrari:01491] [[65229,0],0] ORTE_ERROR_LOG: Not found in file base/plm_base_launch_support.c at line 99
>> [ferrari:01491] [[65229,0],0] ORTE_ERROR_LOG: Not found in file plm_rsh_module.c at line 1167
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users