Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Bcast issue
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-08-11 10:12:36


On Aug 11, 2010, at 12:10 AM, Randolph Pullen wrote:

> Sure, but broadcasts are faster - less reliable apparently, but much faster for large clusters.

Just to be totally clear: MPI_BCAST is defined to be "reliable", in the sense that it will complete or invoke an error (vs. unreliable data streams like UDP where sending a packet may or may not arrive at the receiver).

I think you're saying that something in your setup does not appear to be functioning properly -- possibly an OMPI bug, possibly TCP timeouts, possibly incorrect use of MPI, possibly ...etc. But I just wanted to disambiguate the meaning of the word "reliable" here.

> Jeff says that all OpenMPI calls are implemented with point to point B-tree style communications of log N transmissions

Just to clarify so that I'm not mis-quoted, I said: "All of Open MPI's network-based collectives use point-to-point communications underneath (shared memory may not, but that's not the issue here)".

1. "Collectives" means a very different thing than "all Open MPI calls".
2. Some of our algorithms are not based on binary (or binomial -- it's not clear what you meant) trees.

Sorry to be so pedantic -- but mis-quotes like this have been the source of huge misunderstandings in the past.

It is also worth noting that Open MPI's collectives are implemented with plugins -- there's nothing preventing a new plugin that does *not* use point-to-point communication calls (like the shared memory collective implementations, or multicast, or some other kind of hardware collective offload, or ...).

Indeed, I should point out that my statement was not entirely correct because Voltaire just recently committed the "fca" plugin to the OMPI development trunk (to be introduced in OMPI v1.5) that uses IB hardware offloading for MPI collective implementations -- see their press releases and marketing material for how this stuff works. Mellanox has slightly different MPI collective IB hardware offloading technology for Open MPI, too.

> So I guess that altoall would be N log N

I'm not sure of the complexity of OMPI's alltoall algorithms offhand. I see at least 3 algorithms after *quick* look in the OMPI source code. They probably all have their own complexities, but need to be viewed in the context of when those algorithms allow themselves to be used (e.g., O(N) may not matter if there's a small number of peers with small messages).

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/