Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts
From: Terry Dontje (terry.dontje_at_[hidden])
Date: 2010-11-17 11:09:44


On 11/17/2010 10:48 AM, Ralph Castain wrote:
> No problem at all. I confess that I am lost in all the sometimes
> disjointed emails in this thread. Frankly, now that I search, I can't
> find it either! :-(
>
> I see one email that clearly shows the external binding report from
> mpirun, but not from any daemons. I see another email (after you asked
> if there was all the output) that states "yep", indicating that was
> all the output, and then proceeds to offer additional output that
> wasn't in the original email you asked about!
>
> So I am now as thoroughly confused as you are...
>
> That said, I am confident in the code in ORTE as it has worked
> correctly when I tested it against external bindings in other
> environments. So I really do believe this is an OGE issue where the
> orted isn't getting correctly bound against all allocated cores.
>
I am confused by your statement above because we don't even know what is
being bound or not. We know that in it looks like the hnp is bound to 2
cores which is what we asked for but we don't know what any of the
processes themselves are bound to. So I personally cannot point to
ORTE or OGE as the culprit because I don't think we know whether there
is an issue.

So, until we are able to get the -report-bindings output from the a.out
code (note I did not say orted) it is kind of hard to claim there is
even an issue. Which brings me back to the output question. After some
thinking the --report-bindings output I am expecting is not from the
orted itself but from the a.out before it executes the user code.
Which now makes me wonder if there is some odd OGE/OMPI integration
issue which the -bind-to-code -report-bindings options are not being
propagated/recognized/honored when qsub is given the -binding option.

> Perhaps if someone could run this test again with --report-bindings
> --leave-session-attached and provide -all- output we could verify that
> analysis and clear up the confusion?
>
Yeah, however I bet you we still won't see output.

--td
>
>
> On Wed, Nov 17, 2010 at 8:13 AM, Terry Dontje <terry.dontje_at_[hidden]
> <mailto:terry.dontje_at_[hidden]>> wrote:
>
> On 11/17/2010 10:00 AM, Ralph Castain wrote:
>> --leave-session-attached is always required if you want to see
>> output from the daemons. Otherwise, the launcher closes the ssh
>> session (or qrsh session, in this case) as part of its normal
>> operating procedure, thus terminating the stdout/err channel.
>>
>>
> I believe you but isn't it weird that without the --binding option
> to qsub we saw -report-bindings output from the orteds?
>
> Do you have the date of the email that has the info you talked
> about below. I really am not trying to be an a-hole about this
> but there have been so much data and email flying around it would
> be nice to actually see the output you mention.
>
> --td
>
>
>> On Wed, Nov 17, 2010 at 7:51 AM, Terry Dontje
>> <terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>> wrote:
>>
>> On 11/17/2010 09:32 AM, Ralph Castain wrote:
>>> Cris' output is coming solely from the HNP, which is correct
>>> given the way things were executed. My comment was from
>>> another email where he did what I asked, which was to
>>> include the flags:
>>>
>>> --report-bindings --leave-session-attached
>>>
>>> so we could see the output from each orted. In that email,
>>> it was clear that while mpirun was bound to multiple cores,
>>> the orteds are being bound to a -single- core.
>>>
>>> Hence the problem.
>>>
>> Hmm, I see Ralph's comment on 11/15 but I don't see any
>> output that shows what Ralph say's above. The only
>> report-bindings output I see is when he runs without OGE
>> binding. Can someone give me the date and time of Chris'
>> email with the --report-bindings and
>> --leave-session-attached. Or a rerun of the below with the
>> --leave-session-attached option would also help.
>>
>> I find it confusing that --leave-session-attached is not
>> required when the OGE binding argument is not given.
>>
>> --td
>>
>>> HTH
>>> Ralph
>>>
>>>
>>> On Wed, Nov 17, 2010 at 6:57 AM, Terry Dontje
>>> <terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>>
>>> wrote:
>>>
>>> On 11/17/2010 07:41 AM, Chris Jewell wrote:
>>>> On 17 Nov 2010, at 11:56, Terry Dontje wrote:
>>>>>> You are absolutely correct, Terry, and the 1.4 release series does include the proper code. The point here, though, is that SGE binds the orted to a single core, even though other cores are also allocated. So the orted detects an external binding of one core, and binds all its children to that same core.
>>>>> I do not think you are right here. Chris sent the following which looks like OGE (fka SGE) actually did bind the hnp to multiple cores. However that message I believe is not coming from the processes themselves and actually is only shown by the hnp. I wonder if Chris adds a "-bind-to-core" option we'll see more output from the a.out's before they exec unterm?
>>>> As requested using
>>>>
>>>> $ qsub -pe mpi 8 -binding linear:2 myScript.com'
>>>>
>>>> and
>>>>
>>>> 'mpirun -mca ras_gridengine_verbose 100 --report-bindings -by-core -bind-to-core ./unterm'
>>>>
>>>> [exec5:06671] System has detected external process binding to cores 0028
>>>> [exec5:06671] ras:gridengine: JOB_ID: 59434
>>>> [exec5:06671] ras:gridengine: PE_HOSTFILE: /usr/sge/default/spool/exec5/active_jobs/59434.1/pe_hostfile
>>>> [exec5:06671] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows slots=2
>>>> [exec5:06671] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows slots=2
>>>> [exec5:06671] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows slots=1
>>>> [exec5:06671] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows slots=1
>>>> [exec5:06671] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows slots=1
>>>> [exec5:06671] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows slots=1
>>>>
>>>> No more info. I note that the external binding is slightly different to what I had before, but our cluster is busier today :-)
>>>>
>>> I would have expected more output.
>>>
>>> --td
>>>
>>>> Chris
>>>>
>>>>
>>>> --
>>>> Dr Chris Jewell
>>>> Department of Statistics
>>>> University of Warwick
>>>> Coventry
>>>> CV4 7AL
>>>> UK
>>>> Tel: +44 (0)24 7615 0778
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Oracle
>>> Terry D. Dontje | Principal Software Engineer
>>> Developer Tools Engineering | +1.781.442.2631
>>> Oracle *- Performance Technologies*
>>> 95 Network Drive, Burlington, MA 01803
>>> Email terry.dontje_at_[hidden]
>>> <mailto:terry.dontje_at_[hidden]>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Oracle
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.781.442.2631
>> Oracle *- Performance Technologies*
>> 95 Network Drive, Burlington, MA 01803
>> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Oracle
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle *- Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>



picture
picture
picture
picture