I am trying to implement the following collectives in MPI
sharedmemory, Alltoall, Broadcast, Reduce with zero copy
optimizations.So for Reduce, my compiler allocates all the send
buffers in sharedmemory (mmap anonymous), and allocates only one
receive buffer againin shared memory. Then all the processes reduce to
the root buffer ina data parallel manner. Now it looks like openmpi is
doing somethingsimilar except that they must copy from/to the
send/receive buffers.So my implementation of reduce should perform
better for large buffersizes. But that is not the case. Anybody knows
why? Any pointers arewelcome.
Also the openmpi reduce performance has large variations. I run
reducewith different array sizes with np = 8 50 times and for a single
arraysize, I find that there is a significant number of outliers.
Didanybody face similar problems?