This sounds like the fuel problem we're facing right now. Potentially,
there are enough resources (for now). Simultaneously, there is enough
demand (for ever). But they are connected by this artificially
maintained tiny pipe ...
The tuned collective are not supposed to adapt to all cases. They are
supposed to deliver the best performance available when each process
have its own dedicated network resources. In other words, when there
is one process per node. Why CT6 deliver better performances ? Process
placement and the communication pattern are just few factors that
affect these performances. Change one of them and for a specific
configuration will get a [possibly] large improvement in terms of
performance. However, it's a temporary benefit, because it doesn't
solve the real issue, it just hide it away.
Until we have the hierarch collective working, there is no miracle
solution to this problem. Well ... except not starting 16 processes
per node :)
On Apr 15, 2008, at 1:45 PM, Rolf Vandevaart wrote:
> I have been running the IMB performance tests and noticed some
> strange behavior. This is running on a CentOS cluster with 16
> processes per node and using the openib btl. Currently, I am
> looking at the MPI_Barrier performance. Since we make use of a
> recursive double algorithm (in the tuned collective) I would have
> expected to see a log2(np) type performance. However, the data is
> much worse than log2(np) with the trunk being worse than v1.2.4.
> One interesting piece of data is that I replaced the tuned algorithm
> with one that is very similar (copied from Sun Clustertools 6) , but
> instead of each process doing send/recv, we have each one do a send
> to their lower partners followed by a receive with their upper
> partners. Then, everything is reversed which finished the
> barrier. For reasons unknown, this appears to perform better even
> thought both algorithms should be log2(np).
> Another interesting fact is that when run on my really slow cluster
> over TCP (latency of 150 usec) the tuned barrier algorithm very
> closely follows the expected log2(np).
> I have mentioned this issue to a few people, but thought I would
> share with a wider audience to see if anyone else has observed
> MPI_Barrier that is not log2(np). I have attached two pdfs. The
> first one shows my results and the second one is a picture of the
> two different barrier algorithms.
> devel mailing list
- application/pkcs7-signature attachment: smime.p7s