From: Josh Hursey (jjhursey_at_[hidden])
Date: 2006-08-29 13:55:58


Hey all,

So I'm having trouble getting tests to complete without timing out in
MTT. It seems that the tests timeout and hang in MTT, but complete
normally outside of MTT.

Here are some details:
Build:
   Open MPI Trunk (1.3a1r11481)

Tests:
   Trivial
   ibm

BTL:
   tcp
   self

Nodes/processes:
   16 nodes (32 processors) on the Odin Cluster at IU

In MTT all of the tests timeout:
<mtt snip>
Running command: mpirun -mca btl tcp,self -np 32 --prefix
    /san/homedirs/mpiteam/tmp/mtt-scratch/installs/ompi-nightly-trunk/
odin_g
    cc_warnings/1.3a1r11481/install collective/allgather
Timeout: 1 - 1156872348 (vs. now: 1156872028)
Past timeout! 1156872348 < 1156872349
Past timeout! 1156872348 < 1156872349
Command complete, exit status: 72057594037927935
Evaluating: &or(&eq(&test_exit_status(), 0), &eq(&test_exit_status(),
77))
Got name: test_exit_status
Got args:
_do: $ret = MTT::Values::Functions::test_exit_status()
&test_exit_status returning: 72057594037927935
String now: &or(&eq(72057594037927935, 0), &eq(&test_exit_status(), 77))
Got name: eq
Got args: 72057594037927935, 0
_do: $ret = MTT::Values::Functions::eq(72057594037927935, 0)
&eq got: 72057594037927935 0
&eq: returning 0
String now: &or(0, &eq(&test_exit_status(), 77))
Got name: test_exit_status
Got args:
_do: $ret = MTT::Values::Functions::test_exit_status()
&test_exit_status returning: 72057594037927935
String now: &or(0, &eq(72057594037927935, 77))
Got name: eq
Got args: 72057594037927935, 77
_do: $ret = MTT::Values::Functions::eq(72057594037927935, 77)
&eq got: 72057594037927935 77
&eq: returning 0
String now: &or(0, 0)
Got name: or
Got args: 0, 0
_do: $ret = MTT::Values::Functions::or(0, 0)
&or got: 0 0
&or: returning 0
String now: 0
*** WARNING: Test: allgather, np=32, variant=1: TIMED OUT (failed)
</mtt snip>

Outside of MTT using the same build the test runs and completes
normally:
  $ cd ~/tmp/mtt-scratch/installs/ompi-nightly-trunk/
odin_gcc_warnings/1.3a1r11481/tests/ibm/ibm/
  $ mpirun -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/
tmp/mtt-scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/
1.3a1r11481/install collective/allgather
  $

Any thoughts on why this might be happening in MTT but not outside of
it?

Cheers,
Josh