On Dec 2, 2009, at 10:24 AM, Nicolas Bock wrote:

On Tue, Dec 1, 2009 at 20:58, Nicolas Bock <nicolasbock@gmail.com> wrote:

On Tue, Dec 1, 2009 at 18:03, Ralph Castain <rhc@open-mpi.org> wrote:
You may want to check your limits as defined by the shell/system. I can also run this for as long as I'm willing to let it run, so something else appears to be going on.

Is that with 1.3.3? I found that with 1.3.4 I can run the example much longer until I hit this error message:

[master] (31996) forking processes
[mujo:14273] opal_os_dirpath_create: Error: Unable to create the sub-directory (/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/13386/31998) of (/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/13386/31998/0), mkdir failed [1]
[mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 101
[mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 425
[mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file base/ess_base_std_app.c at line 132
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS

After some googling I found that this is apparently an ext3 filesystem limitation, i.e. there can be only 31998 subdirectories in a directory. Why is openmpi creating all of these directories in the first place? Is there a way to "recycle" them?

The session directories are built to house shared memory backing files, plus other potential entries depending upon options. They should be deleted upon finalize of each process, so you shouldn't be running out of them.

I can check to see that the code is cleaning them out (or at least, attempting to do so). Not sure if there is something about ext3 that might retain the directory entries until the "parent" process terminates, even though the files have been deleted.

If you do an ls on the directory tree, do you see 32k subdirectories? Or do you only see the ones for the active processes?


users mailing list