--leave-session-attached is always required if you
want to see output from the daemons. Otherwise, the launcher
closes the ssh session (or qrsh session, in this case) as part of
its normal operating procedure, thus terminating the stdout/err
I believe you but isn't it weird that without the --binding option
to qsub we saw -report-bindings output from the orteds?
Do you have the date of the email that has the info you talked about
below. I really am not trying to be an a-hole about this but there
have been so much data and email flying around it would be nice to
actually see the output you mention.
Cris' output is coming solely
from the HNP, which is correct given the way things
were executed. My comment was from another email where
he did what I asked, which was to include the flags:
so we could see the output from each orted. In
that email, it was clear that while mpirun was bound
to multiple cores, the orteds are being bound to a
Hence the problem.
Hmm, I see Ralph's comment on 11/15 but I don't see any
output that shows what Ralph say's above. The only
report-bindings output I see is when he runs without OGE
binding. Can someone give me the date and time of Chris'
email with the --report-bindings and
--leave-session-attached. Or a rerun of the below with
the --leave-session-attached option would also help.
I find it confusing that --leave-session-attached is not
required when the OGE binding argument is not given.
You are absolutely correct, Terry, and the 1.4 release series does include the proper code. The point here, though, is that SGE binds the orted to a single core, even though other cores are also allocated. So the orted detects an external binding of one core, and binds all its children to that same core.
I do not think you are right here. Chris sent the following which looks like OGE (fka SGE) actually did bind the hnp to multiple cores. However that message I believe is not coming from the processes themselves and actually is only shown by the hnp. I wonder if Chris adds a "-bind-to-core" option we'll see more output from the a.out's before they exec unterm?
As requested using
$ qsub -pe mpi 8 -binding linear:2 myScript.com'
'mpirun -mca ras_gridengine_verbose 100 --report-bindings -by-core -bind-to-core ./unterm'
[exec5:06671] System has detected external process binding to cores 0028
[exec5:06671] ras:gridengine: JOB_ID: 59434
[exec5:06671] ras:gridengine: PE_HOSTFILE: /usr/sge/default/spool/exec5/active_jobs/59434.1/pe_hostfile
[exec5:06671] ras:gridengine: exec5.cluster.stats.local: PE_HOSTFILE shows slots=2
[exec5:06671] ras:gridengine: exec1.cluster.stats.local: PE_HOSTFILE shows slots=2
[exec5:06671] ras:gridengine: exec4.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec5:06671] ras:gridengine: exec3.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec5:06671] ras:gridengine: exec2.cluster.stats.local: PE_HOSTFILE shows slots=1
[exec5:06671] ras:gridengine: exec7.cluster.stats.local: PE_HOSTFILE shows slots=1
No more info. I note that the external binding is slightly different to what I had before, but our cluster is busier today :-)