Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Comm_spawn lots of times
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-12-02 17:00:42


Indeed - that is very helpful! Thanks!

Looks like we aren't cleaning up high enough - missing the directory level. I seem to recall seeing that error go by and that someone fixed it on our devel trunk, so this is likely a repair that didn't get moved over to the release branch as it should have done.

I'll look into it and report back.

Thanks again
Ralph

On Dec 2, 2009, at 2:45 PM, Nicolas Bock wrote:

>
>
> On Wed, Dec 2, 2009 at 14:23, Ralph Castain <rhc_at_[hidden]> wrote:
> Hmm....if you are willing to keep trying, could you perhaps let it run for a brief time, ctrl-z it, and then do an ls on a directory from a process that has already terminated? The pids will be in order, so just look for an early number (not mpirun or the parent, of course).
>
> It would help if you could give us the contents of a directory from a child process that has terminated - would tell us what subsystem is failing to properly cleanup.
>
> Ok, so I Ctrl-Z the master. In /tmp/.private/nbock/openmpi-sessions-nbock_at_mujo_0 I now have only one directory
>
> /tmp/.private/nbock/openmpi-sessions-nbock_at_mujo_0/857
>
> I can't find that PID though. mpirun has PID 4230, orted does not exist, master is 4231, and slave is 4275. When I "fg" master and Ctrl-Z it again, slave has a different PID as expected. I Ctrl-Z'ed in iteration 68, there are 70 sequentially numbered directories starting at 0. Every directory contains another directory called "0". There is nothing in any of those directories. I see for instance:
>
> /tmp/.private/nbock/openmpi-sessions-nbock_at_mujo_0/857 $ ls -lh 70
> total 4.0K
> drwx------ 2 nbock users 4.0K Dec 2 14:41 0
>
> and
>
> nbock_at_mujo /tmp/.private/nbock/openmpi-sessions-nbock_at_mujo_0/857 $ ls -lh 70/0/
> total 0
>
> I hope this information helps. Did I understand your question correctly?
>
> nick
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users