I just stumped into the following behavior of Open MPI 1.4.2. Used jobscript:
env | grep TMPDIR
Situation 1: getting 4 slots in total from 2 queues on 2 nodes. Output:
pc15381 1 extra.q_at_pc15381 UNDEFINED
pc15370 1 extra.q_at_pc15370 UNDEFINED
pc15381 1 all.q_at_pc15381 UNDEFINED
pc15370 1 all.q_at_pc15370 UNDEFINED
The slot of the master is in the first line of the PE_HOSTFILE. The job runs on pc15381, with one local fork of dummy.sh and doing two times a `qrsh -inherit` from pc15381 to pc15370 (checked with `ps -e f`). So only 3 instances are running, instead of four.
Situation 2: getting 4 slots in total from 2 queues on one and the same node.
pc15370 2 all.q_at_pc15370 UNDEFINED
pc15370 2 extra.q_at_pc15370 UNDEFINED
It looks like for the master node of the parallel job, always only one entry of the PE_HOSTFILE is honored. So 2 processes are missing here.
So I see two isuses:
(1) Number of started tasks is wrong. I'm not sure, whether the correct behavior should be:
a) add up all slots for each machine, also for the master node of the job, and fork this number of slots
b) fork only the slots mentioned for the master queue of the job, and make a local `qrsh -inherit` for the slots running in a different queue on the same host. So the third column of the PE_HOSTFILE should be honored too.
(2) In situation 1: from the example, one slot on pc15370 should run in all.q and get an appropriate $TMPDIR. This is of course a bug in SGE, which I will investigate on the SGE list.