I have sent the following experiences to the SGE
mailing list, but I thought I would also try here…
I have been trying out version 1.2b2 for its
integration with SGE. The simple “hello world” test program works
fin by itself, but there are issues when submitting it to SGE.
For small numbers of tasks, for SOME runs, I get
errors for each of the non-master tasks, and they are all one of the following:
error: commlib error: got read error (closing
"blade27.bluehelix.cshl.edu/execd/1")
error: commlib error: can't read general message
size header (GMSH) (closing "blade221
.bluehelix.cshl.edu/execd/1")
When I repeat runs, these errors tend to go away,
like the first time a node runs a job it coughs on it, but then it is OK for
subsequent jobs. I do get the correct output.
Things change when I try a large job, say 400 tasks.
I get loads of GMSH errors, but NO output, and SGE’s qstat command aborts:
[heywood@blade1 ompi]$ qsub -pe mpi 400 hello.sh
Your job 8239 ("hello.sh") has been
submitted
[heywood@blade1 ompi]$ qstat -t
critical error: unrecoverable error - contact
systems manager
Aborted
[heywood@blade1 ompi]$
I then have to qdel the job from another window.
If anyone has seen anything like this, I’d be
interested in hearing. Since the errors are coming from SGE’s
communication library, I did increase the file descriptor limit (ulimit –n
65536), but it made no difference.
Thanks,
Todd Heywood