I have sent the following experiences to the SGE mailing list, but I thought I would also try here…
I have been trying out version 1.2b2 for its integration with SGE. The simple “hello world” test program works fin by itself, but there are issues when submitting it to SGE.
For small numbers of tasks, for SOME runs, I get errors for each of the non-master tasks, and they are all one of the following:
error: commlib error: got read error (closing "blade27.bluehelix.cshl.edu/execd/1")
error: commlib error: can't read general message size header (GMSH) (closing "blade221
When I repeat runs, these errors tend to go away, like the first time a node runs a job it coughs on it, but then it is OK for subsequent jobs. I do get the correct output.
Things change when I try a large job, say 400 tasks. I get loads of GMSH errors, but NO output, and SGE’s qstat command aborts:
[heywood@blade1 ompi]$ qsub -pe mpi 400 hello.sh
Your job 8239 ("hello.sh") has been submitted
[heywood@blade1 ompi]$ qstat -t
critical error: unrecoverable error - contact systems manager
I then have to qdel the job from another window.
If anyone has seen anything like this, I’d be interested in hearing. Since the errors are coming from SGE’s communication library, I did increase the file descriptor limit (ulimit –n 65536), but it made no difference.