Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?
From: Rahul Nabar (rpnabar_at_[hidden])
Date: 2010-08-24 17:44:06


On Mon, Aug 23, 2010 at 9:43 PM, Richard Treumann <treumann_at_[hidden]> wrote:
> Bugs are always a possibility but unless there is something very unusual
> about the cluster and interconnect or this is an unstable version of MPI, it
> seems very unlikely this use of MPI_Bcast with so few tasks and only a 1/2
> MB message would trip on one.  80 tasks is a very small number in modern
> parallel computing.  Thousands of tasks involved in an MPI collective has
> become pretty standard.

Here's something absolutely strange that I accidentally stumbled upon:

I ran the test again, but accidentally forgot to kill the
user-jobs already running on the test servers (via. Torque and our
usual queues).
I was about to kick myself, but I couldn't believe that the test
actually completes! I mean the timings are horribly bad but the test
( for the first time ) runs to completion. How could this be happening?
Doesn't make sense to me that the test completes when the
cards+servers+network is loaded but not otherwise! But I repeated the
experiment many times and still the same result.

# /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 256 bcast
[snip]
# Bcast
       #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
            0 1000 0.02 0.02 0.02
            1 34 546807.94 626743.09 565196.07
            2 34 37159.11 52942.09 44910.73
            4 34 19777.97 40382.53 29656.53
            8 34 36060.21 53265.27 43909.68
           16 34 11765.59 31912.50 19611.75
           32 34 23530.79 41176.94 32532.89
           64 34 11735.91 23529.02 16552.16
          128 34 47998.44 59323.76 55164.14
          256 34 18121.96 30500.15 25528.95
          512 34 20072.76 33787.32 26786.55
         1024 34 39737.29 55589.97 45704.99
         2048 9 77787.56 150555.66 118741.83
         4096 9 44444.67 118331.78 77201.40
         8192 9 80835.66 166666.56 133781.08
        16384 9 77032.88 149890.66 119558.73
        32768 9 111819.45 177778.99 149048.91
        65536 9 159304.67 222298.99 195071.34
       131072 9 172941.13 262216.57 218351.14
       262144 9 161371.65 266703.79 223514.31
       524288 2 497.46 4402568.94 2183980.20
      1048576 2 5401.49 3519284.01 1947754.45
      2097152 2 75251.10 4137861.49 2220910.50
      4194304 2 33270.48 4601072.91 2173905.32
# All processes entering MPI_Finalize

Another observation is that if I replace the openib BTL with the tcp
BTL the tests run OK.

-- 
Rahul