Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] OMPI vs Scali performance comparisons
From: Tim Mattox (timattox_at_[hidden])
Date: 2009-03-18 18:19:16

That might indicate the source of the bandwidth difference.
Open MPI uses the compiler supplied memcpy, which may or
may not be particularly fast for a given machine/architecture.
Scali could very well be using its own tuned memcpy.

On the hulk and tank systems at IU (16 core intel shared mem machines),
I saw a factor of 2 difference in the performance of memcpy from glibc
and a simple x86 asm routine. The asm routine was twice as fast in
some case, particularly the case where the data was larger than the L2

On Wed, Mar 18, 2009 at 5:12 PM, Eugene Loh <Eugene.Loh_at_[hidden]> wrote:
> I don't have access to the machine where my colleague ran.  On other
> machines, it appears that playing with eager or fragsize doesn't change
> much... and, in any case, OMPI bandwidth is up around memcpy bandwidth.  So,
> maybe the first challenge is reproducing what he saw and/or getting access
> to his system.
> Terry Dontje wrote:
>> George Bosilca wrote:
>>> Something like this. We can play with the eager size too, maybe 4K is too
>>> small.
>> I guess I am curious why the larger buffer sizes work better?  I am
>> curious because we ran into a similar issue on one of our platforms and it
>> turned out to be the non-temporal copy was not initiated until a large (64K)
>> memcpy.
>>> On Mar 18, 2009, at 06:43 , Terry Dontje wrote:
>>>> George Bosilca wrote:
>>>>> The default values for the large message fragments are not optimized
>>>>> for the new generation processors. This might be something to investigate,
>>>>> in order to see if we can have the same bandwidth as they do or not.
>>>> Are you suggesting bumping up the btl_sm_max_send_size value from 32K to
>>>> something greater?
>>>>> On Mar 17, 2009, at 18:23 , Eugene Loh wrote:
>>>>>> A colleague of mine ran some microkernels on an 8-way Barcelona box
>>>>>> (Sun x2200M2 at 2.3 GHz).  Here are some performance comparisons with Scali.
>>>>>>  The performance tests are modified versions of the HPCC pingpong tests.
>>>>>>  The OMPI version is the trunk with my "single-queue" fixes... otherwise,
>>>>>> OMPI latency at higher np would be noticeably worse.
>>>>>>           latency(ns)   bandwidth(MB/s)
>>>>>>         (8-byte msgs)   (2M-byte msgs)
>>>>>>         =============    =============
>>>>>>   np    Scali    OMPI    Scali    OMPI
>>>>>>    2      327     661     1458    1295
>>>>>>    4      369     670     1517    1287
>>>>>>    8      414     758     1535    1294
>>>>>> OMPI latency is nearly 2x slower than Scali's.  Presumably, "fastpath"
>>>>>> PML latency optimizations would help us a lot here.  Thankfully, our latency
>>>>>> is flat with np with the recent "single-queue" fixes... otherwise our
>>>>>> high-np latency story would be so much worse.  We're behind on bandwidth as
>>>>>> well, though not as pitifully so.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]

Tim Mattox, Ph.D. -
 tmattox_at_[hidden] || timattox_at_[hidden]
    I'm a bright...