Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
From: Rayson Ho (raysonlogin_at_[hidden])
Date: 2012-03-15 00:41:09


Hi Joshua,

I don't think the new built-in rsh in later versions of Grid Engine is
going to make any difference - the orted is the real starter of the
MPI tasks and should have a greater influence on the task environment.

However, it would help if you can record the nice values and resource
limits of each of the MPI task - you can easily do so by using a shell
wrapper like this one:

========================================
#!/bin/sh

# resource limit
ulimit -a > /tmp/mpijob.$$

# nice value
ps -eo pid,user,nice,command | grep $$

# run real executable
<PATH to real executable>

exit $?
========================================

Use mpirun to submit it as if it is the real MPI application - then
you can see if there are limits introduced by Grid Engine that are
causing issues...

Rayson

=================================
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/

On Thu, Mar 15, 2012 at 12:28 AM, Joshua Baker-LePain <jlb17_at_[hidden]> wrote:
> On Thu, 15 Mar 2012 at 12:44am, Reuti wrote
>
>
>> Which version of SGE are you using? The traditional rsh startup was
>> replaced by the builtin startup some time ago (although it should still
>> work).
>
>
> We're currently running the rather ancient 6.1u4 (due to the "If it ain't
> broke..." philosophy).  The hardware for our new queue master recently
> arrived and I'll soon be upgrading to the most recent Open Grid Scheduler
> release.  Are you saying that the upgrade with the new builtin startup
> method should avoid this problem?
>
>
>> Maybe this shows already the problem: there are two `qrsh -inherit`, as
>> Open MPI thinks these are different machines (I ran only with one slot on
>> each host hence didn't get it first but can reproduce it now). But for SGE
>> both may end up in the same queue overriding the openmpi-session in $TMPDIR.
>>
>> Although it's running: you get all output? If I request 4 slots and get
>> one from each queue on both machines the mpihello outputs only 3 lines: the
>> "Hello World from Node 3" is always missing.
>
>
> I do seem to get all the output -- there are indeed 64 Hello World lines.
>
> Thanks again for all the help on this.  This is one of the most productive
> exchanges I've had on a mailing list in far too long.
>
>
> --
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/