Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?
From: Rahul Nabar (rpnabar_at_[hidden])
Date: 2010-08-24 17:44:06


On Mon, Aug 23, 2010 at 9:43 PM, Richard Treumann <treumann_at_[hidden]> wrote:
> Bugs are always a possibility but unless there is something very unusual
> about the cluster and interconnect or this is an unstable version of MPI, it
> seems very unlikely this use of MPI_Bcast with so few tasks and only a 1/2
> MB message would trip on one.  80 tasks is a very small number in modern
> parallel computing.  Thousands of tasks involved in an MPI collective has
> become pretty standard.

Here's something absolutely strange that I accidentally stumbled upon:

I ran the test again, but accidentally forgot to kill the
user-jobs already running on the test servers (via. Torque and our
usual queues).
I was about to kick myself, but I couldn't believe that the test
actually completes! I mean the timings are horribly bad but the test
( for the first time ) runs to completion. How could this be happening?
Doesn't make sense to me that the test completes when the
cards+servers+network is loaded but not otherwise! But I repeated the
experiment many times and still the same result.

# /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 256 bcast
[snip]
# Bcast
       #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
            0 1000 0.02 0.02 0.02
            1 34 546807.94 626743.09 565196.07
            2 34 37159.11 52942.09 44910.73
            4 34 19777.97 40382.53 29656.53
            8 34 36060.21 53265.27 43909.68
           16 34 11765.59 31912.50 19611.75
           32 34 23530.79 41176.94 32532.89
           64 34 11735.91 23529.02 16552.16
          128 34 47998.44 59323.76 55164.14
          256 34 18121.96 30500.15 25528.95
          512 34 20072.76 33787.32 26786.55
         1024 34 39737.29 55589.97 45704.99
         2048 9 77787.56 150555.66 118741.83
         4096 9 44444.67 118331.78 77201.40
         8192 9 80835.66 166666.56 133781.08
        16384 9 77032.88 149890.66 119558.73
        32768 9 111819.45 177778.99 149048.91
        65536 9 159304.67 222298.99 195071.34
       131072 9 172941.13 262216.57 218351.14
       262144 9 161371.65 266703.79 223514.31
       524288 2 497.46 4402568.94 2183980.20
      1048576 2 5401.49 3519284.01 1947754.45
      2097152 2 75251.10 4137861.49 2220910.50
      4194304 2 33270.48 4601072.91 2173905.32
# All processes entering MPI_Finalize

Another observation is that if I replace the openib BTL with the tcp
BTL the tests run OK.

-- 
Rahul