Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts
From: Terry Dontje (terry.dontje_at_[hidden])
Date: 2010-11-17 10:13:29


On 11/17/2010 10:00 AM, Ralph Castain wrote:
> --leave-session-attached is always required if you want to see output
> from the daemons. Otherwise, the launcher closes the ssh session (or
> qrsh session, in this case) as part of its normal operating procedure,
> thus terminating the stdout/err channel.
>
>
I believe you but isn't it weird that without the --binding option to
qsub we saw -report-bindings output from the orteds?

Do you have the date of the email that has the info you talked about
below. I really am not trying to be an a-hole about this but there have
been so much data and email flying around it would be nice to actually
see the output you mention.

--td

> On Wed, Nov 17, 2010 at 7:51 AM, Terry Dontje <terry.dontje_at_[hidden]
> <mailto:terry.dontje_at_[hidden]>> wrote:
>
> On 11/17/2010 09:32 AM, Ralph Castain wrote:
>> Cris' output is coming solely from the HNP, which is correct
>> given the way things were executed. My comment was from another
>> email where he did what I asked, which was to include the flags:
>>
>> --report-bindings --leave-session-attached
>>
>> so we could see the output from each orted. In that email, it was
>> clear that while mpirun was bound to multiple cores, the orteds
>> are being bound to a -single- core.
>>
>> Hence the problem.
>>
> Hmm, I see Ralph's comment on 11/15 but I don't see any output
> that shows what Ralph say's above. The only report-bindings
> output I see is when he runs without OGE binding. Can someone
> give me the date and time of Chris' email with the
> --report-bindings and --leave-session-attached. Or a rerun of the
> below with the --leave-session-attached option would also help.
>
> I find it confusing that --leave-session-attached is not required
> when the OGE binding argument is not given.
>
> --td
>
>> HTH
>> Ralph
>>
>>
>> On Wed, Nov 17, 2010 at 6:57 AM, Terry Dontje
>> <terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>> wrote:
>>
>> On 11/17/2010 07:41 AM, Chris Jewell wrote:
>>> On 17 Nov 2010, at 11:56, Terry Dontje wrote:
>>>>> You are absolutely correct, Terry, and the 1.4 release series does include the proper code. The point here, though, is that SGE binds the orted to a single core, even though other cores are also allocated. So the orted detects an external binding of one core, and binds all its children to that same core.
>>>> I do not think you are right here. Chris sent the following which looks like OGE (fka SGE) actually did bind the hnp to multiple cores. However that message I believe is not coming from the processes themselves and actually is only shown by the hnp. I wonder if Chris adds a "-bind-to-core" option we'll see more output from the a.out's before they exec unterm?
>>> As requested using
>>>
>>> $ qsub -pe mpi 8 -binding linear:2 myScript.com'
>>>
>>> and
>>>
>>> 'mpirun -mca ras_gridengine_verbose 100 --report-bindings -by-core -bind-to-core ./unterm'
>>>
>>> [exec5:06671] System has detected external process binding to cores 0028
>>> [exec5:06671] ras:gridengine: JOB_ID: 59434
>>> [exec5:06671] ras:gridengine: PE_HOSTFILE: /usr/sge/default/spool/exec5/active_jobs/59434.1/pe_hostfile
>>> [exec5:06671] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows slots=2
>>> [exec5:06671] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows slots=2
>>> [exec5:06671] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows slots=1
>>> [exec5:06671] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows slots=1
>>> [exec5:06671] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows slots=1
>>> [exec5:06671] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows slots=1
>>>
>>> No more info. I note that the external binding is slightly different to what I had before, but our cluster is busier today :-)
>>>
>> I would have expected more output.
>>
>> --td
>>
>>> Chris
>>>
>>>
>>> --
>>> Dr Chris Jewell
>>> Department of Statistics
>>> University of Warwick
>>> Coventry
>>> CV4 7AL
>>> UK
>>> Tel: +44 (0)24 7615 0778
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Oracle
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.781.442.2631
>> Oracle *- Performance Technologies*
>> 95 Network Drive, Burlington, MA 01803
>> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Oracle
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle *- Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>



picture
picture
picture