No problem at all. I confess that I am lost in all the
sometimes disjointed emails in this thread. Frankly, now that I
search, I can't find it either! :-(
I see one email that clearly shows the external binding
report from mpirun, but not from any daemons. I see another
email (after you asked if there was all the output) that states
"yep", indicating that was all the output, and then proceeds to
offer additional output that wasn't in the original email you
So I am now as thoroughly confused as you are...
That said, I am confident in the code in ORTE as it has
worked correctly when I tested it against external bindings in
other environments. So I really do believe this is an OGE issue
where the orted isn't getting correctly bound against all
I am confused by your statement above because we don't even know
what is being bound or not. We know that in it looks like the hnp
is bound to 2 cores which is what we asked for but we don't know
what any of the processes themselves are bound to. So I personally
cannot point to ORTE or OGE as the culprit because I don't think we
know whether there is an issue.
So, until we are able to get the -report-bindings output from the
a.out code (note I did not say orted) it is kind of hard to claim
there is even an issue. Which brings me back to the output
question. After some thinking the --report-bindings output I am
expecting is not from the orted itself but from the a.out before it
executes the user code. Which now makes me wonder if there is some
odd OGE/OMPI integration issue which the -bind-to-code
-report-bindings options are not being propagated/recognized/honored
when qsub is given the -binding option.
Perhaps if someone could run this test again with
--report-bindings --leave-session-attached and provide -all-
output we could verify that analysis and clear up the confusion?
Yeah, however I bet you we still won't see output.
always required if you want to see output from the
daemons. Otherwise, the launcher closes the ssh
session (or qrsh session, in this case) as part of its
normal operating procedure, thus terminating the
I believe you but isn't it weird that without the
--binding option to qsub we saw -report-bindings output
from the orteds?
Do you have the date of the email that has the info you
talked about below. I really am not trying to be an
a-hole about this but there have been so much data and
email flying around it would be nice to actually see the
output you mention.
Cris' output is
coming solely from the HNP, which is
correct given the way things were
executed. My comment was from another
email where he did what I asked, which
was to include the flags:
so we could see the output from
each orted. In that email, it was
clear that while mpirun was bound to
multiple cores, the orteds are being
bound to a -single- core.
Hence the problem.
Hmm, I see Ralph's comment on 11/15 but I
don't see any output that shows what Ralph
say's above. The only report-bindings
output I see is when he runs without OGE
binding. Can someone give me the date and
time of Chris' email with the
--leave-session-attached. Or a rerun of the
below with the --leave-session-attached
option would also help.
I find it confusing that
--leave-session-attached is not required
when the OGE binding argument is not given.
You are absolutely correct, Terry, and the 1.4 release series does include the proper code. The point here, though, is that SGE binds the orted to a single core, even though other cores are also allocated. So the orted detects an external binding of one core, and binds all its children to that same core.
I do not think you are right here. Chris sent the following which looks like OGE (fka SGE) actually did bind the hnp to multiple cores. However that message I believe is not coming from the processes themselves and actually is only shown by the hnp. I wonder if Chris adds a "-bind-to-core" option we'll see more output from the a.out's before they exec unterm?
As requested using
$ qsub -pe mpi 8 -binding linear:2 myScript.com'
'mpirun -mca ras_gridengine_verbose 100 --report-bindings -by-core -bind-to-core ./unterm'
[exec5:06671] System has detected external process binding to cores 0028
[exec5:06671] ras:gridengine: JOB_ID: 59434
[exec5:06671] ras:gridengine: PE_HOSTFILE: /usr/sge/default/spool/exec5/active_jobs/59434.1/pe_hostfile
[exec5:06671] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows slots=2
[exec5:06671] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows slots=2
[exec5:06671] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec5:06671] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec5:06671] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec5:06671] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows slots=1
No more info. I note that the external binding is slightly different to what I had before, but our cluster is busier today :-)