I managed to find time to reproduce the issue, although it is not very
reproducible in it's results and I suspect it may not be easy to
reproduce with a simple code plus I've never actually constructed a
mpi code so.... (I am cc'ing Michael Sternberg who compiled the
openmpi in case there are flags to add to the compilation.)
I have 8 processes on a single dual quadcore reading from the same
file using formatted fortran I/O. I deliberately created an error in
the read. If this error is a format error, all the processes
terminate. If the error is because there is not enough data (EOF), I
get somewhere from 1 to 7 zombie's. They don't seem to be doing
anything (top -ulmarks shows no CPU activity) but I have no idea if
they have locks on the file or anything else (I think they might, but
have no idea how to tell).
On Fri, Jan 29, 2010 at 6:18 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> On Jan 29, 2010, at 9:13 AM, Laurence Marks wrote:
>> OK, but trivial codes don't always reproduce problems.
> Yes, but if the problem is a file reading beyond the end, that should be fairly isolated behavior.
>> Is strace useful?
> Sure. Â But let's check to see if the apps are actually dying or hanging first.
> Jeff Squyres
> users mailing list
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Chair, Commission on Electron Crystallography of IUCR
Electron crystallography is the branch of science that uses electron
scattering and imaging to study the structure of matter.