Sorry for the late replies but with work, time zones etc… This post has been going on for a while and in an attempt bring it to a close I’m going to try to collapse this down to some core issues and answer all the questions in 1 place. Richard: yes your last statement is correct, I am just using PVm solely as a launcher, the MPI worlds are semantically independent. Jeffs suggestion that it may be a network congestion issue rings a bell somewhere. Jeff, although it is possible to make a small example program, this would require PVM or some other method of launching MPI simultaneously on each node. I would agree that this is a bit off topic for this forum and so I won’t mention it further. In finalizing this issue, I would like to discuss the characteristics of the other options available. If I understood what to expect from the alltoall on a large cluster and given the scenario outlined below it may help me greatly in deciding how (or if) to rewrite this. BTW: Jeff, sorry if I miss quoted you, I must have missunderstood. From your post reproduced in part here: ============================================================== > All of Open MPI's networkbased collectives use pointtopoint >communications underneath (shared memory may not, but that's not the issue >here). One of the implementations is linear, meaning that the root sends the >message to comm rank 1, then comm rank 2, ..etc. But this algorithm is only >used when the message is small, the number of peers is small, etc. All the other >algorithms are parallel in nature, meaning that after an iteration or two, >multiple processes have the data and can start pipelining sends to other >processes, etc. ============================================================== What I meant when I said btree is nearly right I think – I should have said ‘in an NTree manner’ but both would produce O(log N) solutions and I agree that these are all perfectly fine for almost everything. This assumes that you have ‘adequate’ network bandwidth as you correctly pointed out in your recent post. This may not be the case for my problem (see below) The Problem:  A large all to all (N to N transmission or N broadcasts) of possibly hundreds of GB in total.  The cluster size my clients will use is unknown at this time but probably in the order of between 10 to a few hundred nodes.  The number of nodes is likely to increase with the data size but the ratio of data/node is unknown and variable. My design Goals: 1. Speed and accuracy are everything. Accuracy is paramount but the system would become unusable if this algorithm became exponential. 2. I love the flexibility OMPI brings to fabric deployment. I want to pass on the richness of these choices to my clients/customers – however if IB (or some other) plugin solution moved the alltoall algorithm from say O(N log N) to just O(Log N) transmission, its mandatory use may be an acceptable solution on larger clusters My Assumptions 1. I can concentrate on providing the best near linear solutions and ignore site implementation peculiarities 2. Tuning each installation can accommodate all site specific idiosyncrasies 3. The solution will probably be network bound. No mater how fast the network is, 100GB may well be too much for concurrent p2p transmissions to run in O(log N) time [please feel free to trash my assumptions] This is a difficult problem, I have written 3 solutions for this using different technologies and I have been unsatisfied with each so far. Theoretically the problem can be solved in N broadcasts but [Jeff] as you point out, in practice, data loss is likely on the nodes who are not ready etc.. However a near O(N) solution should be possible. It appears that OMPI’s Bcast is O(log N) for N > a trivial number of nodes So AlltoAll is probably at least O(N log N) – unless it utilises something other than p2p transmissions and its only O(N log N) if there is adequate bandwidth on the network fabric. Do I have it correct? Is alltoall going to work for me ? Randolph  On Fri, 13/8/10, Jeff Squyres <jsquyres@cisco.com> wrote:
