I am measuring some timings for MPI_Send/MPI_Recv. I am doing a single
communication between 2 processes and I repeat this several times to get
meaningful values. The message being sent varies from 64 bytes up to 16
MBs, doubling the size each time (64, 128, 256,....8M, 16M).
I give you here some context information on the way I am executing this
The experiment is executed on a multicore architecture, 2 processes
bound to 2 distinct cores of the CPU. The 2 processes run on the same node.
The underlying CPU is an AMD Istanbul CPU (6 cores) 64KB L1 data cache
64 KB L2 data cache, 512KB L2 Cache and 6 MB L3 (shared) Cache. The node
contains 2 sockets therefore each CPU gets exactly one of the 2 MPI
I am using OpenMPI version 1.4.4 (compiled by myself using the default
configurations, didn't use any fancy SM implementation)
In order to force the SM module I run my code using the following MCA
parameter: "--mca btl sm,self"
I am also aware of the *eager_limit* and various threshold present in
the OpenMPI library. In order to not get confused I set these two
parameters to 16MB (twice the size of the L3 cache):
*btl_sm_eager_limit* and *btl_sm_max_send_size*
Beside the time I am measuring a couple of HW counters using PAPI. In
particular I am interested in total instructions (PAPI_TOT_INS) and
branch instructions (PAPI_BR_INS).
Enough with the context, this is what I am observing. At 16 MB there is
a clear increase in the number of instructions and branch instructions
(and this can be explained by my settings of eager_limit and max send
However something weird already happens at 32K where I clearly see an
increase in the number of branches and total instructions. The fact is
that there are almost 0 branch instructions until 32KB and starting from
32KB to 16MB there is a linear increase. At 16MB there is another jump
and then again linear increase.
It seems that there is another threshold driving this behavior. I tried
to set these other parameters for the SM BTL, btl_sm_fifo_size,
btl_sm_exclusivity but nothing changed. For my understanding of MPI,
this should be a kind of pipe-lining of the message which is being
transferred by chunks (of probably 32KB size).
How can I override this behavior? Is there any parameter I can set?
I also noticed that while this is happening for the MPI_Send, the
MPI_Recv operation behaves differently. For the receive routine there is
no bump in terms of branch and total instructions. The increase is
linear starting from 64 bytes. The increase of branch instructions slows
down however after the 16MB threshold. My idea about that is that
probably the receive is busy waiting for the message and therefore the
number of branches grows proportionally with the time spent for the
message to arrive.
This is my hypothesis but you probably know better.
graphs are attached. Thanks in advance for your help.
cheers, Simone P.