On Mar 22, 2011, at 14:20 , Ralph Castain wrote:
> Hi folks
> For those interested in trying it, I completed backporting the multicast grpcomm module from my branch over the last weekend. This allows all modex and other ORTE-level collective operations to occur via multicast, which significantly improves the performance of those operations.
Looks promising. Based on my understanding of the multicast protocols and their implementations, I wonder how you overcome some of the limitations of the UDP multicast.
As the IP multicast is a one-to-many protocol, only broadcast types of collective can be expressed efficiently. So this only cover half the modex operations and half the initial application spawn (not the daemon URI collection). However, this is still better than nothing!
Unfortunately, multicast over UDP inherit one of the major feature from UDP, it's unreliability. While packet drop can hardly be triggered on a single switch configuration, this is not a reliable approach. I noticed you implemented a fixed size windows (based on a circular ring) to increase the reliability of the UDP rmcast. However, what will happens when thousands modex messages will collide is not yet clear? Apparently, if the lost message is not found on the buffer, no drastic action is taken (aka the job will just hang). Thus, without a __reliability__ layer built-on, this is not a practice we should encourage on a production quality software.
If we assume the context of a LAN, then there are 3 categories: hub only LAN, switch without IGMP and switch with IGMP control. The first two are similar, the broadcast is going over all output links (it is a flooding protocol: the message will be dropped at the kernel level, if no application awaits for it). For the last class, the output is only going on the segments where hosts have requested it. Therefore, in order to make sure nobody miss a single multicast, one has to verify that all processes supposed to get involved in the bcast, are readily available for receiving. While this doesn't sound like a big issue, it implies a many-to-one type of operation in the context of ORTE.
Last issue is about the port/address allocation. It appears that the current implementation relies on MCA parameters (base_multicast_ports) to insure uniqueness of port/multicast address allocation. Therefore, when two mpirun run simultaneously on different machines of the same cluster, the user (or the users) will have to ensure mutual exclusion of the ports.
> In order to use it, you'll need to add --enable-multicast to your configure, and -mca grpcomm mcast to your cmd line. You'll also need a reasonably good udp multicast environment. The new module will work with any launch environment.
> I'm not really focused on scalability in my branch (mostly on resilience), but I did some quick experiments and found that the new module reduced modex time by quite a bit, depending on system and scale of course.
> I hope to finish my backport over the next week or so - the last part will enable ALL orte system operations to be done via multicast. This eliminates things like the initial TCP connection flood back to the HNP when the daemons are launched. Again, I don't focus much on scalability, so anyone wanting to test that capability at scale will be welcome. I'll send out another note when it is ready.
> devel mailing list
"To preserve the freedom of the human mind then and freedom of the press, every spirit should be ready to devote itself to martyrdom; for as long as we may think as we will, and speak as we think, the condition of man will proceed in improvement."
-- Thomas Jefferson, 1799