Indeed - that is very helpful! Thanks!

Looks like we aren't cleaning up high enough - missing the directory level. I seem to recall seeing that error go by and that someone fixed it on our devel trunk, so this is likely a repair that didn't get moved over to the release branch as it should have done.

I'll look into it and report back.

Thanks again
Ralph

On Dec 2, 2009, at 2:45 PM, Nicolas Bock wrote:



On Wed, Dec 2, 2009 at 14:23, Ralph Castain <rhc@open-mpi.org> wrote:
Hmm....if you are willing to keep trying, could you perhaps let it run for a brief time, ctrl-z it, and then do an ls on a directory from a process that has already terminated? The pids will be in order, so just look for an early number (not mpirun or the parent, of course).

It would help if you could give us the contents of a directory from a child process that has terminated - would tell us what subsystem is failing to properly cleanup.

Ok, so I Ctrl-Z the master. In  /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only one directory

/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857

I can't find that PID though. mpirun has PID 4230, orted does not exist, master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it again, slave has a different PID as expected. I Ctrl-Z'ed in iteration 68, there are 70 sequentially numbered directories starting at 0. Every directory contains another directory called "0". There is nothing in any of those directories. I see for instance:

/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70
total 4.0K
drwx------ 2 nbock users 4.0K Dec  2 14:41 0

and

nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70/0/
total 0

I hope this information helps. Did I understand your question correctly?

nick

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users