> Date: Sat, 16 Aug 2008 08:18:47 -0400 From: Jeff Squyres
> <jsquyres_at_[hidden]> Subject: Re: [OMPI users] SM btl slows down
> bandwidth? To: Open MPI Users <users_at_[hidden]> Message-ID:
> <1197BCE6-A7E3-499E-8B05-B85F7598D455_at_[hidden]> Content-Type:
> text/plain; charset=US-ASCII; format=flowed; delsp=yes On Aug 15,
> 2008, at 3:32 PM, Gus Correa wrote:
>> > Just like Daniel and many others, I have seen some disappointing
>> > performance of MPI code on multicore machines,
>> > in code that scales fine in networked environments and single core
>> > CPUs,
>> > particularly in memory-intensive programs.
>> > The bad performance has been variously ascribed to memory
>> > bandwidth / contention,
>> > to setting processor and memory affinity versus letting the kernel
>> > scheduler do its thing,
>> > to poor performance of memcpy, and so on.
> I'd suspect that all of these play a role -- not necessarily any one
> single one of them.
> - It is my believe (contrary to several kernel developers' beliefs)
> that explicitly setting processor affinity is a Good Thing for MPI
> applications. Not only does MPI have more knowledge than the OS for a
> parallel job spanning multiple processes, each MPI process is
> allocating resources that may be spatially / temporally relevant. For
> example, say that an MPI process allocates some memory during MPI_INIT
> in a NUMA system. This memory will likely be "near" in a NUMA sense.
> If the OS later decides to move that process, then the memory would be
> "far" in a NUMA sense. Similarly, OMPI decides what I/O resources to
> use during MPI_INIT -- and may specifically choose some "near"
> resources (and exclude "far" resources). If the OS moves the process
> after MPI_INIT, these "near" and "far" determinations could become
> stale/incorrect, and performance would go down the tubes.
I've been in the discussion above for many years on the same side as
Jeff however I think it is more due to pragmatic reasoning than because
MPI is the right level for binding processes. The Solaris kernel
developers I've talked with believe the right way to do the above is for
MPI or the runtime to give hints to the OS as to locality binding of
processes and have the OS try and maintain the locality. The reason
being is that there might be other processes that the OS is dealing with
that MPI or its runtime do not know about. Having MPI or its runtime
force binding really messes up an OSes ability to try and balance the
workload on a system. Now mind you on a machine with small number of
cores <8 this probably isn't as big of an issue. But once you start
dealing with large SMPs with 100s of cores there is definitely a good
chance that there is more than one MPI job running on a machine.
However, until MPI and OS implementors come up with a way to pass such
hints it does become a necessity for MPI to do the binding for reasons
Jeff supplies above. Note myself, Jeff and another member have talked
about such hints but have not come up with anything definitive.
> - Unoptimized memcpy implementations is definitely a factor, mainly
> for large message transfers through shared memory. Since most (all?)
> MPI implementations use some form of shared memory for on-host
> communication, memcpy can play a big part of its performance for large
> messages. Using hardware (such as IB HCAs) for on-host communication
> can effectively avoid unoptimized memcpy's, but then you're just
> shifting the problem to the hardware -- you're now dependent upon the
> hardware's DMA engine (which is *usually* pretty good). But then
> other issues can arise, such as the asynchronicity of the transfer,
> potentially causing collisions and/or extra memory bus traversals that
> might be able to be avoided with memcpy (it depends on the topology
> inside your server -- e.g., if 2 processes are "far" from the IB HCA,
> then the transfer will have to traverse QPI/HT/whatever twice, whereas
> a memcpy would assumedly stay local). As Ron pointed out in this
> thread, non-temporal memcpy's can be quite helpful for benchmarks that
> don't touch the resulting message at the receiver (because the non-
> temporal memcpy doesn't bother to take the time to load the cache).
In addition to the above you may run into platform specific memory
architecture issues. Like should the SM BTL be laying out the fifos in
a specific way to get the best performance. The problem is what is
great for one platform may suck eggs for another.
> - Using different compilers is a highly religious topic, and IMHO, may
> tend to be application specific. Compilers are large complex software
> systems (just like MPI); different compiler authors have chosen to
> implement different optimizations that work well in different
> applications. So yes, you may well see different run-time performance
> with different compilers depending on your application and/or MPI
> implementations. Some compilers may have better memcpy's.
> My $0.02: I think there are a *lot* of factors involved here.
I agree and we probably just scratched the surface here.
Sun Microsystems, Inc.