On 02/03/2014 03:09 PM, Ralph Castain wrote:
> OMPI will error out in that case, as you originally reported. What seems to be happening is that you have a bunch of stale session directories, but I'm puzzled because the creation dates are so current - for whatever reason, OMPI seems to be getting the same jobid much more often than it should. Can you tell me something about the environment - e.g., is it managed or just using hostfile?
This computer is used about 11 times a day to launch about 1500
executions on our in-house (finite element) code.
We do launch at most 12 single process executions at the same time, but
we use PETSc, which always initialize the MPI environment...
Also, we are launching some tests which use between 2 to 128 processes
(on the same computer) just to ensure proper code testing. In fact,
performance is not quit an issue in these 128 processes tests and we set
the following environment variable:
because we encountered timeout problems before...
The whole testing lasts about 1 hour and the result is used to give a
feed-back for users who "pushed" modifications to the code....
So I would add: sometime the tests may be interrupted by segfaults,
"kill -TERM" or anything you can imagine... The problem now is that it
won't even start if a mere file exists...
I can flush those files right now, but I am almost sure they will
reappear it the following days, leading to false "bad results" for the
tests... and I will have to setup a cleanup procedure before launching
all the tests... But that will not prevent the fact that those files may
be created while running the firsts of the 1500 tests and have 1 or some
of the rest to fail....
I hope this is the information you wanted... Is it?
> On Feb 3, 2014, at 12:00 PM, Eric Chamberland <Eric.Chamberland_at_[hidden]> wrote:
>> On 02/03/2014 02:49 PM, Ralph Castain wrote:
>>> Seems rather odd - is your /tmp by any chance network mounted?
>> No it is a "normal" /tmp:
>> "cd /tmp; df -h ." gives:
>> Filesystem Size Used Avail Use% Mounted on
>> /dev/sda1 49G 17G 30G 37% /
>> And there is plenty of disk space...
>> I agree it is odd, but how should OpenMPI react when trying to create a directory over an existing file name? I mean what is it programmed to do?