Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Leftover session directories [was sm btl choices]
From: Rolf Vandevaart (Rolf.Vandevaart_at_[hidden])
Date: 2010-03-01 16:34:17

On 03/01/10 11:51, Ralph Castain wrote:
> On Mar 1, 2010, at 8:41 AM, David Turner wrote:
>> On 3/1/10 1:51 AM, Ralph Castain wrote:
>>> Which version of OMPI are you using? We know that the 1.2 series was unreliable about removing the session directories, but 1.3 and above appear to be quite good about it. If you are having problems with the 1.3 or 1.4 series, I would definitely like to know about it.
>> Oops; sorry! OMPI 1.4.1, compiled with PGI 10.0 compilers,
>> running on Scientific Linux 5.4, ofed 1.4.2.
>> The session directories are *frequently* left behind. I have
>> not really tried to characterize under what circumstances they
>> are removed. But please confirm: they *should* be removed by
>> OMPI.
> Most definitely - they should always be removed by OMPI. This is the first report we have had of them -not- being removed in the 1.4 series, so it is disturbing.
> What environment are you running under? Does this happen under normal termination, or under abnormal failures (the more you can tell us, the better)?

Hi Ralph:

It turns out that I am seeing session directories left behind as well
with v1.4 (r22713) I have not tested any other versions. I believe
there are two elements that make this reproducible.
1. Run across 2 or more nodes.
2. CTRL-C out of the MPI job.

Then take a look at the remote nodes and you may see a leftover session
directory. The mpirun node seems to be clean.

Here is an example using two nodes. I also added some sleeps to the
ring_c program to slow things down so I could hit CTRL-C.

First, tmp directories are empty:
[rolfv_at_burl-ct-x2200-6 ~/examples]$ ls -lt /tmp/openmpi-sessions-rolfv*
ls: No match.
[rolfv_at_burl-ct-x2200-7 ~]$ ls -lt /tmp/openmpi-sessions-rolfv*
ls: No match.

Now run test:
[rolfv_at_burl-ct-x2200-6 ~/examples]$ mpirun -np 4 -host
burl-ct-x2200-6,burl-ct-x2200-6,burl-ct-x2200-7,burl-ct-x2200-7 ring_slow_c
Process 0 sending 10 to 1, tag 201 (4 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
mpirun: killing job...

mpirun noticed that process rank 0 with PID 3002 on node burl-ct-x2200-6
exited on signal 0 (Unknown signal 0).
4 total processes killed (some possibly by mpirun during cleanup)
mpirun: clean termination accomplished

[burl-ct-x2200-6:02990] 2 more processes have sent help message
help-mpi-btl-openib.txt / default subnet prefix

Now check tmp directories:
[rolfv_at_burl-ct-x2200-6 ~/examples]$ ls -lt /tmp/openmpi-sessions-rolfv*
ls: No match.
[rolfv_at_burl-ct-x2200-7 ~]$ ls -lt /tmp/openmpi-sessions-rolfv*
total 8
drwx------ 3 rolfv hpcgroup 4096 Mar 1 17:27 20007/