I have sent the following experiences to the SGE mailing list, but I
thought I would also try here...
I have been trying out version 1.2b2 for its integration with SGE. The
simple "hello world" test program works fin by itself, but there are
issues when submitting it to SGE.
For small numbers of tasks, for SOME runs, I get errors for each of the
non-master tasks, and they are all one of the following:
error: commlib error: got read error (closing
error: commlib error: can't read general message size header (GMSH)
When I repeat runs, these errors tend to go away, like the first time a
node runs a job it coughs on it, but then it is OK for subsequent jobs.
I do get the correct output.
Things change when I try a large job, say 400 tasks. I get loads of GMSH
errors, but NO output, and SGE's qstat command aborts:
[heywood_at_blade1 ompi]$ qsub -pe mpi 400 hello.sh
Your job 8239 ("hello.sh") has been submitted
[heywood_at_blade1 ompi]$ qstat -t
critical error: unrecoverable error - contact systems manager
I then have to qdel the job from another window.
If anyone has seen anything like this, I'd be interested in hearing.
Since the errors are coming from SGE's communication library, I did
increase the file descriptor limit (ulimit -n 65536), but it made no