> No, I assumed it based on comparisions between doing and not doing small
> msg rdma at various scales, from a paper Galen pointed out to me.
Actually, I wasn't so much concerned with how you jumped to your conclusion.
I just wanted to point out that you did. Most people who focus on ping-pong
latency like you have don't realize that they're jumping to a conclusion.
You suggested that optimizing for a latency micro-benchmark would benefit
small clusters, and that's just not (uniformly) true.
> Benchmarks are what they are. In the above paper, the tests place the
> cross-over at around 64 nodes and that confirms a number of anecdotal
> reports I got. It may well be that in some situations, small-msg rdma is
> better only for 2 nodes, but that's note such a likely scenario; reality
> is sometimes linear (at least at our scale :-) ) after all.
Well, if you didn't like me pointing out that jump, then I'll try a different
one. It's fairly straightforward to correlate the latency performance of
the micro-benchmark directly to RDMA versus send/recv. You can't really
do the same for the NPB results, since things like collective communication
performance can play a big part. So, assuming that RDMA is the reason that
MVAPICH wins where it does may not hold.
I apologize if it seems like I'm picking on you. I'm hypersensitive to
people trying to make judgements based on micro-benchmark performance.
I've been trying to make an argument that two-node ping-pong latency
comparisons really only have meaning in the context of a whole system.
The answer to the question of why the latency performance of my 10,000-node
machine is worse than someone else's 128-node cluster has alot to do with
meeting the scaling requirements of a 10,000-node machine. (To some extent
it has to do with the vendor as well, but that's a different story...)