Jeff Squyres <jsquyres_at_[hidden]> writes:
> Could the nodes be running out of shared memory and/or temp filesystem
I'm also seeing this non-reproducibly (on OpenSuSE 10.3, with Sun's
Clustertools 8.1 prerelease on dual Barcelona nodes during PMB runs
under SGE). I haven't had time to build the final 1.3 release.
Certainly /tmp space shouldn't have been a problem on these systems.
What exactly does `shared memory' mean above? The jobs don't appear to
be using shared memory segments, at least. In that case, what's there
to run out of, other than filespace, and shouldn't there be an error
reported creating the file then? It smacks more of a race condition,
since it's errno 2, not 28 or something.
I'm writing in a good deal of ignorance -- as is doubtless obvious --
and don't have time to grovel the code, but I might be able to get extra
diagnostics if anyone can suggest something that doesn't involve much