Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] SGE integration when getting slots from different queues on one and the same host mismatch
From: Reuti (reuti_at_[hidden])
Date: 2010-08-10 06:34:38


Hi,

I just stumped into the following behavior of Open MPI 1.4.2. Used jobscript:

***
#!/bin/sh
export PATH=~/local/openmpi-1.4.2/bin:$PATH
cat $PE_HOSTFILE
mpiexec ./dummy.sh
***

with dummy.sh:

***
#!/bin/sh
env | grep TMPDIR
sleep 30
***
===
Situation 1: getting 4 slots in total from 2 queues on 2 nodes. Output:

pc15381 1 extra.q_at_pc15381 UNDEFINED
pc15370 1 extra.q_at_pc15370 UNDEFINED
pc15381 1 all.q_at_pc15381 UNDEFINED
pc15370 1 all.q_at_pc15370 UNDEFINED
TMPDIR=/tmp/1888.1.extra.q
TMPDIR=/tmp/1888.1.extra.q
TMPDIR=/tmp/1888.1.extra.q

The slot of the master is in the first line of the PE_HOSTFILE. The job runs on pc15381, with one local fork of dummy.sh and doing two times a `qrsh -inherit` from pc15381 to pc15370 (checked with `ps -e f`). So only 3 instances are running, instead of four.

===
Situation 2: getting 4 slots in total from 2 queues on one and the same node.

pc15370 2 all.q_at_pc15370 UNDEFINED
pc15370 2 extra.q_at_pc15370 UNDEFINED
TMPDIR=/tmp/1889.1.all.q
TMPDIR=/tmp/1889.1.all.q

It looks like for the master node of the parallel job, always only one entry of the PE_HOSTFILE is honored. So 2 processes are missing here.

==

So I see two isuses:

(1) Number of started tasks is wrong. I'm not sure, whether the correct behavior should be:

a) add up all slots for each machine, also for the master node of the job, and fork this number of slots

b) fork only the slots mentioned for the master queue of the job, and make a local `qrsh -inherit` for the slots running in a different queue on the same host. So the third column of the PE_HOSTFILE should be honored too.

(2) In situation 1: from the example, one slot on pc15370 should run in all.q and get an appropriate $TMPDIR. This is of course a bug in SGE, which I will investigate on the SGE list.

-- Reuti