Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] memcpy MCA framework
From: George Bosilca (bosilca_at_[hidden])
Date: 2008-08-18 09:42:37


We don't really need a finer grain knowledge about the processor at
compile time. The only thing we should detect is if a bit of code can
or cannot be compiled. We can deal with the processor characteristics
at runtime. I imagine that most of today processors have the
capability of exporting an ID string, with bits set to the supported
instruction sets (at least x86 does). Based on these bits [at runtime]
we can figure out if a special version of memcpy can be used or not.

The second question is how and when to figure out which of the
available memcpy functions give the best performance. On a homogeneous
architecture, this might be a one node selection [I don't imagine
using the modex to spread this information], when on a homogeneous one
every class of processors should do it. The really annoying thing
here, is that in the best case [in a perfect world] this should be
done once per cluster. There is no need to run the benchmark at each
startup. We should think about a storage mechanism, where node can
push small bits information that will be available on subsequent runs.
A little bit like the registry, but more stable...

   george.

On Aug 18, 2008, at 3:16 AM, Brian Barrett wrote:

> I obviously won't be in Dublin (I'll be in a fishing boat in the
> middle of nowhere Canada -- much better), so I'm going to chime in
> now.
>
> The m4 part actually isn't too bad and is pretty simple. I'm not
> sure other than looking at some variables set by ompi_config_asm
> that there is much to check. The hard parts are dealing with the
> finer grained instruction set requirements.
>
> On x86 in particular, many of the operations in the memcpy are part
> of SSE, SSE2, or SSE3. Currently, we don't have any finer concept
> of a processor than x86 and most compilers target an instruction set
> that will run on anything considered 686, which is almost everything
> out there. We'd have to decide how to handle instruction streams
> which are no longer going to work on every chip. Since we know we
> have a number of users with heterogeneous x86 clusters, this is
> something to think about.
>
> Brian
>
> On Aug 17, 2008, at 7:57 AM, Jeff Squyres wrote:
>
>> Let's talk about this in Dublin. I can probably help with the m4
>> magic, but I need to understand exactly what needs to be done first.
>>
>>
>> On Aug 16, 2008, at 11:51 AM, Terry Dontje wrote:
>>
>>> George Bosilca wrote:
>>>> The intent of the memcpy framework is to allow a selection
>>>> between several memcpy at runtime. Of course, there will be a
>>>> preselection at compile time, but all versions that can compile
>>>> on a given architecture will be benchmarked at runtime and the
>>>> best one will be selected. There is a file with several versions
>>>> of memcpy for x86 (32 and 64) somewhere around (I should have one
>>>> if interested), that can be used as a starting point.
>>>>
>>> Ok, I guess I need to look at this code. I wonder if there may be
>>> cases for Sun's machines in which this benchmark could end up
>>> picking the wrong memcpy?
>>>> The only thing we need is a volunteer to build the m4 magic.
>>>> Figuring out what we can compile if kind of tricky, as some of
>>>> the functions are in assembly, some others in C, and some others
>>>> a mixture (the MMX headers).
>>>>
>>> Isn't the atomic code very similar? If I get to this point before
>>> anyone else I probably will volunteer.
>>>
>>> --td
>>>> george.
>>>>
>>>> On Aug 16, 2008, at 3:19 PM, Terry Dontje wrote:
>>>>
>>>>> Hi Tim,
>>>>> Thanks for bringing the below up and asking for a redirection to
>>>>> the devel list. I think looking/using the MCA memcpy framework
>>>>> would be a good thing to do and maybe we can work on this
>>>>> together once I get out from under some commitments. However,
>>>>> some of the challenges that originally scared me away from
>>>>> looking at the memcpy MCA is whether we really want all the OMPI
>>>>> memcpy's to be replaced or just specific ones. Also, I was
>>>>> concerned on trying to figure out which version of memcpy I
>>>>> should be using. I believe currently things are done such that
>>>>> you get one version based on which system you compile on. For
>>>>> Sun there may be several different SPARC platforms that would
>>>>> need to use different memcpy code but we would like to just ship
>>>>> one set of bits.
>>>>> Not saying the above not doable under the memcpy MCA framework
>>>>> just that it somewhat scared me away from thinking about it at
>>>>> first glance.
>>>>>
>>>>> --td
>>>>>> Date: Fri, 15 Aug 2008 12:08:18 -0400 From: "Tim Mattox" <timattox_at_[hidden]
>>>>>> > Subject: Re: [OMPI users] SM btl slows down bandwidth? To:
>>>>>> "Open MPI Users" <users_at_[hidden]> Message-ID: <ea86ce220808150908t62818a21k32c49b9b6f07dca_at_[hidden]
>>>>>> > Content-Type: text/plain; charset=ISO-8859-1 Hi Terry (and
>>>>>> others), I have previously explored this some on Linux/X86-64
>>>>>> and concluded that Open MPI needs to supply it's own memcpy
>>>>>> routine to get good sm performance, since the memcpy supplied
>>>>>> by glibc is not even close to optimal. We have an unused MCA
>>>>>> framework already set up to supply an opal_memcpy. AFAIK,
>>>>>> George and Brian did the original work to set up that
>>>>>> framework. It has been on my to-do list for awhile to start
>>>>>> implementing opal_memcpy components for the architectures I
>>>>>> have access to, and to modify OMPI to actually use opal_memcpy
>>>>>> where ti makes sense. Terry, I presume what you suggest could
>>>>>> be dealt with similarly when we are running/building on SPARC.
>>>>>> Any followup discussion on this should probably happen on the
>>>>>> developer mailing list. On Thu, Aug 14, 2008 at 12:19 PM, Terry
>>>>>> Dontje <Terry.Dontje_at_[hidden]> wrote:
>>>>>>> > Interestingly enough on the SPARC platform the Solaris
>>>>>>> memcpy's actually use
>>>>>>> > non-temporal stores for copies >= 64KB. By default some of
>>>>>>> the mca
>>>>>>> > parameters to the sm BTL stop at 32KB. I've done
>>>>>>> experimentations of
>>>>>>> > bumping the sm segment sizes to above 64K and seen
>>>>>>> incredible speedup on our
>>>>>>> > M9000 platforms. I am looking for some nice way to
>>>>>>> integrate a memcpy that
>>>>>>> > lowers this boundary to 32KB or lower into Open MPI.
>>>>>>> > I have not looked into whether Solaris x86/x64 memcpy's use
>>>>>>> the non-temporal
>>>>>>> > stores or not.
>>>>>>> >
>>>>>>> > --td
>>>>>>>
>>>>>>>> >>
>>>>>>>> >> Message: 1
>>>>>>>> >> Date: Thu, 14 Aug 2008 09:28:59 -0400
>>>>>>>> >> From: Jeff Squyres <jsquyres_at_[hidden]>
>>>>>>>> >> Subject: Re: [OMPI users] SM btl slows down bandwidth?
>>>>>>>> >> To: rbbrigh_at_[hidden], Open MPI Users <users_at_[hidden]>
>>>>>>>> >> Message-ID: <562557EB-857C-4CA8-97AD-F294C7FEDC77_at_[hidden]>
>>>>>>>> >> Content-Type: text/plain; charset=US-ASCII; format=flowed;
>>>>>>>> delsp=yes
>>>>>>>> >>
>>>>>>>> >> At this time, we are not using non-temporal stores for
>>>>>>>> shared memory
>>>>>>>> >> operations.
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> On Aug 13, 2008, at 11:46 AM, Ron Brightwell wrote:
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>>
>>>>>>>>>> >>>>
>>>>>>>>>>
>>>>>>>>>>>> >>>> >> [...]
>>>>>>>>>>>> >>>> >>
>>>>>>>>>>>> >>>> >> MPICH2 manages to get about 5GB/s in shared
>>>>>>>>>>>> memory performance on the
>>>>>>>>>>>> >>>> >> Xeon 5420 system.
>>>>>>>>>>>>
>>>>>>>>>> >>>>
>>>>>>>>>>
>>>>>>>>> >>>
>>>>>>>>>
>>>>>>>>>> >>> >
>>>>>>>>>> >>> > Does the sm btl use a memcpy with non-temporal stores
>>>>>>>>>> like MPICH2?
>>>>>>>>>> >>> > This can be a big win for bandwidth benchmarks that
>>>>>>>>>> don't actually
>>>>>>>>>> >>> > touch their receive buffers at all...
>>>>>>>>>> >>> >
>>>>>>>>>> >>> > -Ron
>>>>>>>>>> >>> >
>>>>>>>>>> >>> >
>>>>>>>>>> >>> > _______________________________________________
>>>>>>>>>> >>> > users mailing list
>>>>>>>>>> >>> > users_at_[hidden]
>>>>>>>>>> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>> >>>
>>>>>>>>>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> -- Jeff Squyres Cisco Systems
>>>>>>>>
>>>>>>> >
>>>>>>> > _______________________________________________
>>>>>>> > users mailing list
>>>>>>> > users_at_[hidden]
>>>>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmattox_at_[hidden]
>>>>>> || timattox_at_[hidden] I'm a bright... http://www.the-brights.net/
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



  • application/pkcs7-signature attachment: smime.p7s