From: Josh Hursey (jjhursey_at_[hidden])
Date: 2006-08-30 09:53:30


On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:

> On 8/29/06 8:57 PM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>
>>> Does this apply to *all* tests, or only some of the tests (like
>>> allgather)?
>>
>> All of the tests: Trivial and ibm. They all timeout :(
>
> Blah. The trivial tests are simply "hello world", so they should
> take just
> about no time at all.
>
> Is this running under SLURM? I put the code in there to know how
> many procs
> to use in SLURM but have not tested it in eons. I doubt that's the
> problem,
> but that's one thing to check.
>

Yep it is in SLURM. and it seems that the 'number of procs' code is
working fine as it changes with different allocations.

> Can you set a super-long timeout (e.g., a few minutes), and while
> one of the
> trivial tests is running, do some ps's on the relevant nodes and
> see what,
> if anything, is running? E.g., mpirun, the test executable on the
> nodes,
> etc.

Without setting a long timeout. It seems that mpirun is running, but
nothing else and only on the launching node.

When a test starts you see the mpirun launching properly:
$ ps aux | grep ...
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32 0:06 perl ./
client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/
mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
mpiteam 15294 0.0 0.0 0 0 ? Z 09:32 0:00 [sh]
<defunct>
mpiteam 28453 0.2 0.0 38072 3536 ? S 09:50 0:00 mpirun
-mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/
install collective/allgather_in_place
mpiteam 28454 0.0 0.0 41716 2040 ? Sl 09:50 0:00 srun --
nodes=16 --ntasks=16 --
nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,odin015
,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
num_procs 16 --vpid_start 0 --universe
mpiteam_at_[hidden]:default-universe-28453 --nsreplica
"0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
129.79.240.107:40904"
mpiteam 28455 0.0 0.0 23212 1804 ? Ssl 09:50 0:00 srun --
nodes=16 --ntasks=16 --
nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,odin015
,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
num_procs 16 --vpid_start 0 --universe
mpiteam_at_[hidden]:default-universe-28453 --nsreplica
"0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
129.79.240.107:40904"
mpiteam 28472 0.0 0.0 36956 2256 ? S 09:50 0:00 /san/
homedirs/mpiteam/tmp/mtt-scratch/installs/ompi-nightly-trunk/
odin_gcc_warnings/1.3a1r11497/install/bin/orted --no-daemonize --
bootproxy 1 --ns-nds slurm --name 0.0.1 --num_procs 16 --vpid_start 0
--universe mpiteam_at_[hidden]:default-universe-28453 --
nsreplica "0.0.0;tcp://129.79.240.107:40904" --gprreplica
"0.0.0;tcp://129.79.240.107:40904"
mpiteam 28482 0.1 0.0 64296 3564 ? S 09:50 0:00
collective/allgather_in_place
mpiteam 28483 0.1 0.0 64296 3564 ? S 09:50 0:00
collective/allgather_in_place

But once the test finishes, mpirun seems to just be hanging out.
$ ps aux | grep ...
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
mpiteam 15083 0.0 0.0 52760 1040 ? S 09:31 0:00 /bin/
bash /var/tmp/slurmd/job148126/script
root 15086 0.0 0.0 42884 3172 ? Ss 09:31 0:00 sshd:
mpiteam [priv]
mpiteam 15088 0.0 0.0 43012 3252 ? S 09:31 0:00 sshd:
mpiteam_at_pts/1
mpiteam 15089 0.0 0.0 56680 1912 pts/1 Ss 09:31 0:00 -tcsh
mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32 0:06 perl ./
client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/
mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
mpiteam 15294 0.0 0.0 0 0 ? Z 09:32 0:00 [sh]
<defunct>
mpiteam 28453 0.0 0.0 38204 3568 ? S 09:50 0:00 mpirun
-mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/
install collective/allgather_in_place

Thoughts?

>
> --
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems