On Wed, Dec 2, 2009 at 14:23, Ralph Castain <email@example.com>
Hmm....if you are willing to keep trying, could you perhaps let it run for a brief time, ctrl-z it, and then do an ls on a directory from a process that has already terminated? The pids will be in order, so just look for an early number (not mpirun or the parent, of course).
It would help if you could give us the contents of a directory from a child process that has terminated - would tell us what subsystem is failing to properly cleanup.
Ok, so I Ctrl-Z the master. In /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0 I now have only one directory
I can't find that PID though. mpirun has PID 4230, orted does not exist, master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it again, slave has a different PID as expected. I Ctrl-Z'ed in iteration 68, there are 70 sequentially numbered directories starting at 0. Every directory contains another directory called "0". There is nothing in any of those directories. I see for instance:
/tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70
drwx------ 2 nbock users 4.0K Dec 2 14:41 0
nbock@mujo /tmp/.private/nbock/openmpi-sessions-nbock@mujo_0/857 $ ls -lh 70/0/
I hope this information helps. Did I understand your question correctly?