Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] SM btl slows down bandwidth?
From: Tim Mattox (timattox_at_[hidden])
Date: 2008-08-15 12:08:18


Hi Terry (and others),
I have previously explored this some on Linux/X86-64 and concluded that
Open MPI needs to supply it's own memcpy routine to get good sm performance,
since the memcpy supplied by glibc is not even close to optimal.
We have an unused MCA framework already set up to supply an opal_memcpy.
AFAIK, George and Brian did the original work to set up that framework.
It has been on my to-do list for awhile to start implementing
opal_memcpy components
for the architectures I have access to, and to modify OMPI to actually
use opal_memcpy
where ti makes sense. Terry, I presume what you suggest could
be dealt with similarly when we are running/building on SPARC.

Any followup discussion on this should probably happen on the
developer mailing list.

On Thu, Aug 14, 2008 at 12:19 PM, Terry Dontje <Terry.Dontje_at_[hidden]> wrote:
> Interestingly enough on the SPARC platform the Solaris memcpy's actually use
> non-temporal stores for copies >= 64KB. By default some of the mca
> parameters to the sm BTL stop at 32KB. I've done experimentations of
> bumping the sm segment sizes to above 64K and seen incredible speedup on our
> M9000 platforms. I am looking for some nice way to integrate a memcpy that
> lowers this boundary to 32KB or lower into Open MPI.
> I have not looked into whether Solaris x86/x64 memcpy's use the non-temporal
> stores or not.
>
> --td
>>
>> Message: 1
>> Date: Thu, 14 Aug 2008 09:28:59 -0400
>> From: Jeff Squyres <jsquyres_at_[hidden]>
>> Subject: Re: [OMPI users] SM btl slows down bandwidth?
>> To: rbbrigh_at_[hidden], Open MPI Users <users_at_[hidden]>
>> Message-ID: <562557EB-857C-4CA8-97AD-F294C7FEDC77_at_[hidden]>
>> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>>
>> At this time, we are not using non-temporal stores for shared memory
>> operations.
>>
>>
>> On Aug 13, 2008, at 11:46 AM, Ron Brightwell wrote:
>>
>>
>>>>
>>>> >> [...]
>>>> >>
>>>> >> MPICH2 manages to get about 5GB/s in shared memory performance on the
>>>> >> Xeon 5420 system.
>>>>
>>>
>>> >
>>> > Does the sm btl use a memcpy with non-temporal stores like MPICH2?
>>> > This can be a big win for bandwidth benchmarks that don't actually
>>> > touch their receive buffers at all...
>>> >
>>> > -Ron
>>> >
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > users_at_[hidden]
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> -- Jeff Squyres Cisco Systems
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmattox_at_[hidden] || timattox_at_[hidden]
 I'm a bright... http://www.the-brights.net/