Hi Todd,
I personally don't know the answer, but I see that Andreas from the open
source grid engine alias (user_at_[hidden]) is addressing
your issues. He should be able to address your issues since it's more
related to the internals of qmaster.
http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=18773
So if anyone else wants to know about what it seems to be related to the
file descriptor limit issue in the internals of the SGE/N1GE, feel free
to follow the comments over there...
Heywood, Todd wrote:
> I have sent the following experiences to the SGE mailing list, but I
> thought I would also try here
>
>
>
> I have been trying out version 1.2b2 for its integration with SGE. The
> simple hello world test program works fin by itself, but there are
> issues when submitting it to SGE.
>
>
>
> For small numbers of tasks, for SOME runs, I get errors for each of the
> non-master tasks, and they are all one of the following:
>
>
>
> error: commlib error: got read error (closing
> "blade27.bluehelix.cshl.edu/execd/1")
>
>
>
> error: commlib error: can't read general message size header (GMSH)
> (closing "blade221
>
> .bluehelix.cshl.edu/execd/1")
>
>
>
> When I repeat runs, these errors tend to go away, like the first time a
> node runs a job it coughs on it, but then it is OK for subsequent jobs.
> I do get the correct output.
>
>
>
> Things change when I try a large job, say 400 tasks. I get loads of GMSH
> errors, but NO output, and SGEs qstat command aborts://
>
>
>
> [heywood_at_blade1 ompi]$ qsub -pe mpi 400 hello.sh
>
> Your job 8239 ("hello.sh") has been submitted
>
> [heywood_at_blade1 ompi]$ qstat -t
>
> critical error: unrecoverable error - contact systems manager
>
> Aborted
>
> [heywood_at_blade1 ompi]$
>
>
>
> I then have to qdel the job from another window.
>
>
>
> If anyone has seen anything like this, Id be interested in hearing.
> Since the errors are coming from SGEs communication library, I did
> increase the file descriptor limit (ulimit n 65536), but it made no
> difference.
>
>
>
> Thanks,
>
>
>
> Todd Heywood
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Thanks,
- Pak Lui
pak.lui_at_[hidden]
|