George Bosilca wrote:
> Something like this. We can play with the eager size too, maybe 4K is
> too small.
I guess I am curious why the larger buffer sizes work better? I am
curious because we ran into a similar issue on one of our platforms and
it turned out to be the non-temporal copy was not initiated until a
large (64K) memcpy.
> On Mar 18, 2009, at 06:43 , Terry Dontje wrote:
>> George Bosilca wrote:
>>> The default values for the large message fragments are not optimized
>>> for the new generation processors. This might be something to
>>> investigate, in order to see if we can have the same bandwidth as
>>> they do or not.
>> Are you suggesting bumping up the btl_sm_max_send_size value from 32K
>> to something greater?
>>> On Mar 17, 2009, at 18:23 , Eugene Loh wrote:
>>>> A colleague of mine ran some microkernels on an 8-way Barcelona box
>>>> (Sun x2200M2 at 2.3 GHz). Here are some performance comparisons
>>>> with Scali. The performance tests are modified versions of the
>>>> HPCC pingpong tests. The OMPI version is the trunk with my
>>>> "single-queue" fixes... otherwise, OMPI latency at higher np would
>>>> be noticeably worse.
>>>> latency(ns) bandwidth(MB/s)
>>>> (8-byte msgs) (2M-byte msgs)
>>>> ============= =============
>>>> np Scali OMPI Scali OMPI
>>>> 2 327 661 1458 1295
>>>> 4 369 670 1517 1287
>>>> 8 414 758 1535 1294
>>>> OMPI latency is nearly 2x slower than Scali's. Presumably,
>>>> "fastpath" PML latency optimizations would help us a lot here.
>>>> Thankfully, our latency is flat with np with the recent
>>>> "single-queue" fixes... otherwise our high-np latency story would
>>>> be so much worse. We're behind on bandwidth as well, though not as
>>>> pitifully so.
>>>> devel mailing list
>>> devel mailing list
>> devel mailing list
> devel mailing list