Hmm....if you are willing to keep trying, could you perhaps let it run for a brief time, ctrl-z it, and then do an ls on a directory from a process that has already terminated? The pids will be in order, so just look for an early number (not mpirun or the parent, of course).
It would help if you could give us the contents of a directory from a child process that has terminated - would tell us what subsystem is failing to properly cleanup.
Thanks - and sorry for the problem.
On Dec 2, 2009, at 2:11 PM, Nicolas Bock wrote:
On Wed, Dec 2, 2009 at 12:12, Ralph Castain <email@example.com>
On Dec 2, 2009, at 10:24 AM, Nicolas Bock wrote:
On Tue, Dec 1, 2009 at 20:58, Nicolas Bock <firstname.lastname@example.org>
On Tue, Dec 1, 2009 at 18:03, Ralph Castain <email@example.com>
You may want to check your limits as defined by the shell/system. I can also run this for as long as I'm willing to let it run, so something else appears to be going on.
Is that with 1.3.3? I found that with 1.3.4 I can run the example much longer until I hit this error message:
[master] (31996) forking processes
[mujo:14273] opal_os_dirpath_create: Error: Unable to create the sub-directory (/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/13386/31998) of (/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/13386/31998/0), mkdir failed 
[mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 101
[mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 425
[mujo:14273] [[13386,31998],0] ORTE_ERROR_LOG: Error in file base/ess_base_std_app.c at line 132
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
--> Returned value Error (-1) instead of ORTE_SUCCESS
After some googling I found that this is apparently an ext3 filesystem limitation, i.e. there can be only 31998 subdirectories in a directory. Why is openmpi creating all of these directories in the first place? Is there a way to "recycle" them?
The session directories are built to house shared memory backing files, plus other potential entries depending upon options. They should be deleted upon finalize of each process, so you shouldn't be running out of them.
I can check to see that the code is cleaning them out (or at least, attempting to do so). Not sure if there is something about ext3 that might retain the directory entries until the "parent" process terminates, even though the files have been deleted.
If you do an ls on the directory tree, do you see 32k subdirectories? Or do you only see the ones for the active processes?
That's a good point. As the master process is running I can see the directory fill up. When I Ctrl-C the master, the directory completely disappears. When I let it run all the way to 32K directories, the directory does not disappear and contains 32K directories even after master gets killed by MPI.
Some process must not be closing some file in these directories which would prevent them from being unlinked, if I understand ext3 correctly.
users mailing list