Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Comm_spawn lots of times
From: Nicolas Bock (nicolasbock_at_[hidden])
Date: 2009-12-02 16:11:28


On Wed, Dec 2, 2009 at 12:12, Ralph Castain <rhc_at_[hidden]> wrote:

>
> On Dec 2, 2009, at 10:24 AM, Nicolas Bock wrote:
>
>
>
> On Tue, Dec 1, 2009 at 20:58, Nicolas Bock <nicolasbock_at_[hidden]> wrote:
>
>>
>>
>> On Tue, Dec 1, 2009 at 18:03, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>> You may want to check your limits as defined by the shell/system. I can
>>> also run this for as long as I'm willing to let it run, so something else
>>> appears to be going on.
>>>
>>>
>>>
>> Is that with 1.3.3? I found that with 1.3.4 I can run the example much
>> longer until I hit this error message:
>>
>>
>> [master] (31996) forking processes
>> [mujo:14273] opal_os_dirpath_create: Error: Unable to create the
>> sub-directory (/tmp/.private/nbock/openmpi-sessions-nbock_at_mujo_0/13386/31998)
>> of (/tmp/.private/nbock/openmpi-sessions-nbock_at_mujo_0/13386/31998/0),
>> mkdir failed [1]
>> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file
>> util/session_dir.c at line 101
>> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file
>> util/session_dir.c at line 425
>> [mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file
>> base/ess_base_std_app.c at line 132
>> --------------------------------------------------------------------------
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems. This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>> orte_session_dir failed
>> --> Returned value Error (-1) instead of ORTE_SUCCESS
>>
>>
> After some googling I found that this is apparently an ext3 filesystem
> limitation, i.e. there can be only 31998 subdirectories in a directory. Why
> is openmpi creating all of these directories in the first place? Is there a
> way to "recycle" them?
>
>
> The session directories are built to house shared memory backing files,
> plus other potential entries depending upon options. They should be deleted
> upon finalize of each process, so you shouldn't be running out of them.
>
> I can check to see that the code is cleaning them out (or at least,
> attempting to do so). Not sure if there is something about ext3 that might
> retain the directory entries until the "parent" process terminates, even
> though the files have been deleted.
>
> If you do an ls on the directory tree, do you see 32k subdirectories? Or do
> you only see the ones for the active processes?
>
> That's a good point. As the master process is running I can see the
directory fill up. When I Ctrl-C the master, the directory completely
disappears. When I let it run all the way to 32K directories, the directory
does not disappear and contains 32K directories even after master gets
killed by MPI.

Some process must not be closing some file in these directories which would
prevent them from being unlinked, if I understand ext3 correctly.

nick