Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Pak Lui (Pak.Lui_at_[hidden])
Date: 2007-01-29 20:10:41


Hi Todd,

I personally don't know the answer, but I see that Andreas from the open
source grid engine alias (user_at_[hidden]) is addressing
your issues. He should be able to address your issues since it's more
related to the internals of qmaster.

http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=18773

So if anyone else wants to know about what it seems to be related to the
file descriptor limit issue in the internals of the SGE/N1GE, feel free
to follow the comments over there...

Heywood, Todd wrote:
> I have sent the following experiences to the SGE mailing list, but I
> thought I would also try here…
>
>
>
> I have been trying out version 1.2b2 for its integration with SGE. The
> simple “hello world” test program works fin by itself, but there are
> issues when submitting it to SGE.
>
>
>
> For small numbers of tasks, for SOME runs, I get errors for each of the
> non-master tasks, and they are all one of the following:
>
>
>
> error: commlib error: got read error (closing
> "blade27.bluehelix.cshl.edu/execd/1")
>
>
>
> error: commlib error: can't read general message size header (GMSH)
> (closing "blade221
>
> .bluehelix.cshl.edu/execd/1")
>
>
>
> When I repeat runs, these errors tend to go away, like the first time a
> node runs a job it coughs on it, but then it is OK for subsequent jobs.
> I do get the correct output.
>
>
>
> Things change when I try a large job, say 400 tasks. I get loads of GMSH
> errors, but NO output, and SGE’s qstat command aborts://
>
>
>
> [heywood_at_blade1 ompi]$ qsub -pe mpi 400 hello.sh
>
> Your job 8239 ("hello.sh") has been submitted
>
> [heywood_at_blade1 ompi]$ qstat -t
>
> critical error: unrecoverable error - contact systems manager
>
> Aborted
>
> [heywood_at_blade1 ompi]$
>
>
>
> I then have to qdel the job from another window.
>
>
>
> If anyone has seen anything like this, I’d be interested in hearing.
> Since the errors are coming from SGE’s communication library, I did
> increase the file descriptor limit (ulimit –n 65536), but it made no
> difference.
>
>
>
> Thanks,
>
>
>
> Todd Heywood
>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Thanks,
- Pak Lui
pak.lui_at_[hidden]