Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-11-17 10:48:05


No problem at all. I confess that I am lost in all the sometimes disjointed
emails in this thread. Frankly, now that I search, I can't find it either!
:-(

I see one email that clearly shows the external binding report from mpirun,
but not from any daemons. I see another email (after you asked if there was
all the output) that states "yep", indicating that was all the output, and
then proceeds to offer additional output that wasn't in the original email
you asked about!

So I am now as thoroughly confused as you are...

That said, I am confident in the code in ORTE as it has worked correctly
when I tested it against external bindings in other environments. So I
really do believe this is an OGE issue where the orted isn't getting
correctly bound against all allocated cores.

Perhaps if someone could run this test again with --report-bindings
--leave-session-attached and provide -all- output we could verify that
analysis and clear up the confusion?

On Wed, Nov 17, 2010 at 8:13 AM, Terry Dontje <terry.dontje_at_[hidden]>wrote:

> On 11/17/2010 10:00 AM, Ralph Castain wrote:
>
> --leave-session-attached is always required if you want to see output from
> the daemons. Otherwise, the launcher closes the ssh session (or qrsh
> session, in this case) as part of its normal operating procedure, thus
> terminating the stdout/err channel.
>
>
> I believe you but isn't it weird that without the --binding option to
> qsub we saw -report-bindings output from the orteds?
>
> Do you have the date of the email that has the info you talked about
> below. I really am not trying to be an a-hole about this but there have
> been so much data and email flying around it would be nice to actually see
> the output you mention.
>
> --td
>
>
> On Wed, Nov 17, 2010 at 7:51 AM, Terry Dontje <terry.dontje_at_[hidden]>wrote:
>
>> On 11/17/2010 09:32 AM, Ralph Castain wrote:
>>
>> Cris' output is coming solely from the HNP, which is correct given the way
>> things were executed. My comment was from another email where he did what I
>> asked, which was to include the flags:
>>
>> --report-bindings --leave-session-attached
>>
>> so we could see the output from each orted. In that email, it was clear
>> that while mpirun was bound to multiple cores, the orteds are being bound to
>> a -single- core.
>>
>> Hence the problem.
>>
>> Hmm, I see Ralph's comment on 11/15 but I don't see any output that
>> shows what Ralph say's above. The only report-bindings output I see is when
>> he runs without OGE binding. Can someone give me the date and time of
>> Chris' email with the --report-bindings and --leave-session-attached. Or a
>> rerun of the below with the --leave-session-attached option would also help.
>>
>> I find it confusing that --leave-session-attached is not required when the
>> OGE binding argument is not given.
>>
>> --td
>>
>> HTH
>> Ralph
>>
>>
>> On Wed, Nov 17, 2010 at 6:57 AM, Terry Dontje <terry.dontje_at_[hidden]>wrote:
>>
>>> On 11/17/2010 07:41 AM, Chris Jewell wrote:
>>>
>>> On 17 Nov 2010, at 11:56, Terry Dontje wrote:
>>>
>>> You are absolutely correct, Terry, and the 1.4 release series does include the proper code. The point here, though, is that SGE binds the orted to a single core, even though other cores are also allocated. So the orted detects an external binding of one core, and binds all its children to that same core.
>>>
>>> I do not think you are right here. Chris sent the following which looks like OGE (fka SGE) actually did bind the hnp to multiple cores. However that message I believe is not coming from the processes themselves and actually is only shown by the hnp. I wonder if Chris adds a "-bind-to-core" option we'll see more output from the a.out's before they exec unterm?
>>>
>>> As requested using
>>>
>>> $ qsub -pe mpi 8 -binding linear:2 myScript.com'
>>>
>>> and
>>>
>>> 'mpirun -mca ras_gridengine_verbose 100 --report-bindings -by-core -bind-to-core ./unterm'
>>>
>>> [exec5:06671] System has detected external process binding to cores 0028
>>> [exec5:06671] ras:gridengine: JOB_ID: 59434
>>> [exec5:06671] ras:gridengine: PE_HOSTFILE: /usr/sge/default/spool/exec5/active_jobs/59434.1/pe_hostfile
>>> [exec5:06671] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows slots=2
>>> [exec5:06671] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows slots=2
>>> [exec5:06671] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows slots=1
>>> [exec5:06671] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows slots=1
>>> [exec5:06671] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows slots=1
>>> [exec5:06671] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows slots=1
>>>
>>> No more info. I note that the external binding is slightly different to what I had before, but our cluster is busier today :-)
>>>
>>>
>>> I would have expected more output.
>>>
>>> --td
>>>
>>> Chris
>>>
>>>
>>> --
>>> Dr Chris Jewell
>>> Department of Statistics
>>> University of Warwick
>>> Coventry
>>> CV4 7AL
>>> UK
>>> Tel: +44 (0)24 7615 0778
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing listusers_at_[hidden]http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> --
>>> [image: Oracle]
>>> Terry D. Dontje | Principal Software Engineer
>>> Developer Tools Engineering | +1.781.442.2631
>>> Oracle * - Performance Technologies*
>>> 95 Network Drive, Burlington, MA 01803
>>> Email terry.dontje_at_[hidden]
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> _______________________________________________
>> users mailing listusers_at_[hidden]http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> --
>> [image: Oracle]
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.781.442.2631
>> Oracle * - Performance Technologies*
>> 95 Network Drive, Burlington, MA 01803
>> Email terry.dontje_at_[hidden]
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing listusers_at_[hidden]http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> [image: Oracle]
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle * - Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden]
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>




picture
picture
picture