On 05/23/2012 03:05 PM, Jeff Squyres wrote:
> On May 23, 2012, at 6:05 AM, Simone Pellegrini wrote:
>>> If process A sends a message to process B and the eager protocol is used then I assume that the message is written into a shared memory area and picked up by the receiver when the receive operation is posted.
> Open MPI has a few different shared memory protocols.
> For short messages, they always follow what you mention above: CICO.
> For large messages, we either use a pipelined CICO (as you surmised below) or use direct memory mapping if you have the Linux knem kernel module installed. More below.
>>> When the rendezvous is utilized however the message still need to end up in the shared memory area somehow. I don't think any RDMA-like transfer exists for shared memory communications.
> Just to clarify: RDMA = Remote Direct Memory Access, and the "remote" usually refers to a different physical address space (e.g., a different server).
> In Open MPI's case, knem can use a direct memory copy between two processes.
>>> Therefore you need to buffer this message somehow, however I assume that you don't buffer the whole thing but use some type of pipelined protocol so that you reduce the size of the buffer you need to keep in the shared memory.
> Correct. For large messages, when using CICO, we copy the first fragment and the necessary meta data to the shmem block. When the receiver ACKs the first fragment, we pipeline CICO the rest of the large message through the shmem block. With the sender and receiver (more or less) simultaneously writing and reading to the circular shmem block, we probably won't fill it up -- meaning that the sender hypothetically won't need to block.
> I'm skipping a bunch of details, but that's the general idea.
>>> Is it completely wrong? It would be nice if someone could point me somewhere I can find more details about this. In the OpenMPI tuning page there are several details regarding the protocol utilized for IB but very little for SM.
> Good point. I'll see if we can get some more info up there.
>> I think I found the answer to my question on Jeff Squyres blog:
>> However now I have a new question, how do I know if my machine uses the copyin/copyout mechanism or the direct mapping?
> You need the Linux knem module. See the OMPI README and do a text search for "knem".
Thanks a lot for the clarification.
however I still have hard time to explain the following phenomena.
I have a very simple code performing a ping/pong between 2 processes
which are allocated on the same computing node. Each process is bound to
a different CPU via affinity settings.
I perform this operation with 3 cache scenarios
1) Cache is completely invalidate before the send/recv (both at the
sender and receiver side)
2) Cache is preloaded before the send/recv operation and it's in
3) Cache is preloaded before the send/recv operation but this time cache
lines are in a "modified" state
Now scenario 2 has a speedup over scenario 1 as expected. However
scenario 3 is much slower then 1. I observed this for both knem and xpmem.
I assume someone is forcing the modified cache lines to be written into
the memory before the copy is performed. Probably because the segment is
assigned to a volatile pointer so somehow the stuff in cache has to be
written into main memory.
Instead when the OpenMPI CICO protocol is used 2 and 3 have the exact
same speedup over 1. Therefore I assume that in this way no-one forces
the write-through of dirty cache lines. I am questioning my self on this
issue since yesterday and it's quite difficult to understand without
knowing all the internal details.
Is this an expected behaviour also for you or you find it surprising? :)