Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-07-28 14:37:38

Tony --

My apologies for taking so long to answer. :-(

I was unfortunately unable to replicate your problem. I ran your source
code across 32 machines connected by TCP with no problem:

  mpirun --hostfile ~/mpi/cdc -np 32 -mca btl tcp,self netbench 8

I tried this on two different clusters with the same results -- it didn't
hang. :-(

Can you try again with a recent nightly tarball, or the 1.1.1 beta tarball
that has been posted?

On 6/30/06 8:35 AM, "Tony Ladd" <ladd_at_[hidden]> wrote:

> Jeff
> Thanks for the reply; I realize you guys must be really busy with the recent
> release of openmpi. I tried 1.1 and I don't get error messages any more. But
> the code now hangs; no error or exit. So I am not sure if this is the same
> issue or something else. I am enclosing my source code. I compiled with icc
> and linked against an icc compiled version of openmpi-1.1.
> My program is a set of network benchmarks (a crude kind of netpipe) that
> checks typical message passing patterns in my application codes.
> Typical output is:
> 32 CPU's: sync call time = 1003.0 time
> rate (Mbytes/s) bandwidth (MBits/s)
> loop buffers size XC XE GS MS XC
> 1 64 16384 2.48e-02 1.99e-02 1.21e+00 3.88e-02 4.23e+01
> 5.28e+01 8.65e-01 2.70e+01 1.08e+04 1.35e+04 4.43e+02 1.38e+04
> 2 64 16384 2.17e-02 2.09e-02 1.21e+00 4.10e-02 4.82e+01
> 5.02e+01 8.65e-01 2.56e+01 1.23e+04 1.29e+04 4.43e+02 1.31e+04
> 3 64 16384 2.20e-02 1.99e-02 1.01e+00 3.95e-02 4.77e+01
> 5.27e+01 1.04e+00 2.65e+01 1.22e+04 1.35e+04 5.33e+02 1.36e+04
> 4 64 16384 2.16e-02 1.96e-02 1.25e+00 4.00e-02 4.85e+01
> 5.36e+01 8.37e-01 2.62e+01 1.24e+04 1.37e+04 4.28e+02 1.34e+04
> 5 64 16384 2.25e-02 2.00e-02 1.25e+00 4.07e-02 4.66e+01
> 5.24e+01 8.39e-01 2.57e+01 1.19e+04 1.34e+04 4.30e+02 1.32e+04
> 6 64 16384 2.19e-02 1.99e-02 1.29e+00 4.05e-02 4.79e+01
> 5.28e+01 8.14e-01 2.59e+01 1.23e+04 1.35e+04 4.17e+02 1.33e+04
> 7 64 16384 2.19e-02 2.06e-02 1.25e+00 4.03e-02 4.79e+01
> 5.09e+01 8.38e-01 2.60e+01 1.23e+04 1.30e+04 4.29e+02 1.33e+04
> 8 64 16384 2.24e-02 2.06e-02 1.25e+00 4.01e-02 4.69e+01
> 5.09e+01 8.39e-01 2.62e+01 1.20e+04 1.30e+04 4.30e+02 1.34e+04
> 9 64 16384 4.29e-01 2.01e-02 6.35e-01 3.98e-02 2.45e+00
> 5.22e+01 1.65e+00 2.64e+01 6.26e+02 1.34e+04 8.46e+02 1.35e+04
> 10 64 16384 2.16e-02 2.06e-02 8.87e-01 4.00e-02 4.85e+01
> 5.09e+01 1.18e+00 2.62e+01 1.24e+04 1.30e+04 6.05e+02 1.34e+04
> Time is total for all 64 buffers. Rate is one way across one link (# of
> bytes/time).
> 1) XC is a bidirectional ring exchange. Each processor sends to the right
> and receives from the left
> 2) XE is an edge exchange. Pairs of nodes exchange data, with each one
> sending and receiving
> 3) GS is the MPI_AllReduce
> 4) MS is my version of MPI_AllReduce. It splits the vector into Np blocks
> (Np is # of processors); each processor then acts as a head node for one
> block. This uses the full bandwidth all the time, unlike AllReduce which
> thins out as it gets to the top of the binary tree. On a 64 node Infiniband
> system MS is about 5X faster than GS-in theory it would be 6X; ie log_2(64).
> Here it is 25X-not sure why so much. But MS seems to be the cause of the
> hangups with messages > 64K. I can run the other benchmarks OK,but this one
> seems to hang for large messages. I think the problem is at least partly due
> to the switch. All MS is doing is point to point communications, but
> unfortunately it sometimes requires a high bandwidth between ASIC's. It
> first it exchanges data between near neighbors in MPI_COMM_WORLD, but it
> must progressively span wider gaps between nodes as it goes up the various
> binary trees. After a while this requires extensive traffic between ASICS.
> This seems to be a problem on both my HP 2724 and the Extreme Networks
> Summit400t-48. I am currently working with Extreme to try to resolve the
> switch issue. As I say; the code ran great on Infiniband, but I think those
> switches have hardware flow control. Finally I checked the code again under
> LAM and it ran OK. Slow, but no hangs.
> To run the code compile and type:
> mpirun -np 32 -machinefile hosts src/netbench 8
> The 8 means 2^8 bytes (ie 256K). This was enough to hang every time on my
> boxes.
> You can also edit the header file (header.h). MAX_LOOPS is how many times it
> runs each test (currently 10); NUM_BUF is the number of buffers in each test
> (must be more than number of processors), SYNC defines the global sync
> frequency-every SYNC buffers. NUM_SYNC is the number of sequential barrier
> calls it uses to determine the mean barrier call time. You can also switch
> the verious tests on and off, which can be useful for debugging
> Tony
> -------------------------------
> Tony Ladd
> Professor, Chemical Engineering
> University of Florida
> PO Box 116005
> Gainesville, FL 32611-6005
> Tel: 352-392-6509
> FAX: 352-392-9513
> Email: tladd_at_[hidden]
> Web:
> _______________________________________________
> users mailing list
> users_at_[hidden]

Jeff Squyres
Server Virtualization Business Unit
Cisco Systems