Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Bcast issue
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-08-11 09:54:36


On Aug 10, 2010, at 10:09 PM, Randolph Pullen wrote:

> Jeff thanks for the clarification,
> What I am trying to do is run N concurrent copies of a 1 to N data movement program to affect an N to N solution. The actual mechanism I am using is to spawn N copies of mpirun from PVM across the cluster. So each 1 to N MPI application starts at the same time with a different node as root.

You mentioned that each root has a large amount of data to broadcast. How large?

Have you done back-of-the-envelope kinds of calculations to determine if you're hitting link contention kinds of limits -- e.g., would running a series of N/M broadcasts sequentially actually result in a net speedup (vs. running all N broadcasts simultaneously) because of lack of network congestion / contention?

If the messages are as large as you imply, then link contention must be taken into account of overall performance, particularly if you're using more than just a handful of nodes.

> Yes I know this is a bit odd… It was an attempt to be lazy and not re-write the code (again) and this appears to be a potential log N solution.

I'm not sure I understand that statement -- why would this be a log(n) solution if everyone is broadcasting simultaneously? (and therefore each root is assumedly using most/all available send bandwidth from its link)

> My thoughts are that the problem must be either:
>
> 1) Some bug in my code that does not occur normally (this seems unlikely because it halts in Bcast and runs in the normal 1 to N manner)
> 2) Something in MPI is fouling the bcast call
> 3) Something in PVM is fouling the bcast call
>
> Obviously, this is not the PVM forum, but have I missed anything?

A fourth possibility is that the network is dropping something that it shouldn't be (with high link contention, this is possible). You haven't mentioned, but I'm assuming that you're running over ethernet -- perhaps you're running into TCP drops and therefore (very long) TCP retransmit timeouts.

If you want to remove PVM from the equation, you could mpirun a trivial bootstrap application across all your nodes that, on each MCW rank process, calls MPI_COMM_SPAWN on MPI_COMM_SELF for the broadcast that is supposed to be rooted on that node.

> BTW: Implementing Bcast with Multicast or a combination of both multicasts and p2p transfers is another option and described by Hoefler et. al. in their paper “A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast”.

Yep; I've read it. Torsten's a smart guy. :-) I'd love to see a plugin contributed that implements this algorithm, or one of other reliable multicast algorithms.

Keep in mind that if N (where N is large) roots are all transmitting very large multicast messages simultaneously, this is a situation where networks are free to drop. In a pathological case like yours, N simultaneous multicasts may not perform as well as you would expect.

> From here I need to decide to:
>
> 1) Generate a minimal example but given that this will require PVM, it is unlikely to see much use.

I think if you can write a small MPI-only example, that would be most helpful.

> 2) Write a N to N transfer system in MPI using inter-communicators, however this may not scale with only p2p transfers and is probably N Log N at best.

Intercommunicators are a red herring here. They were mentioned earlier in the thread because people thought you were using the MPI accept/connect model of joining multiple MPI processes together. If you aren't doing that, intercomms are likely unnecessary.

> 3) Write the N to N transfer system in PVM, Open Fabric calls or something that supports broadcast/multicast calls.

I'm not sure if OpenFabrics verbs support multicast. Mellanox ConnectX cards were supposed to do this eventually, but I don't know if that capability ever was finished (Cisco left the IB business a while ago, so I've stopped paying attention to IB developments).

> My application must transfer a large (potentially huge) amount of tuples from a table distributed across the cluster to a table replicated on each node. The similar (1 to N) system compresses tuples into 64k pages and sends these. The same method would be used and the page size could be varied for efficiency.
>
> What are your thoughts? Can OpenMPI do this in under N log N time?

(Open) MPI is just a message passing library -- in terms of raw bandwidth transfer, it can pretty much do anything that your underlying network can do. Whether MPI_BCAST or MPI_ALLGATHER is the right mechanism or not is a different issue.

(I'll say that OMPI's ALLGATHER algorithm is probably not well optimized for massive data transfers like you describe)

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/