I think most of the issues with the numbers you're getting are coming from the internal protocols of Open MPI and the way the compilers "optimize" the memcpy function. In fact the memcpy function translate to different execution path based on the size of the data. For large memory copies MMX or SSExxx instructions are used. For smaller copies some compilers use the movsb instruction to implement the memcpy. This leads to a significantly smaller number of branches in the PAPI reading, because the movsb __always__ counts as a single branch.
In a similar context we ended up highjacking the memcpy function in order to be able to count the number of branches/misses/instructions and then remove it from the number seen by the upper level. This gives a more consistent view of the number of branches as the compiler choice of the memcpy variant is outside your counting.
On Apr 19, 2012, at 06:53 , Simone Pellegrini wrote:
> Enough with the context, this is what I am observing. At 16 MB there is a clear increase in the number of instructions and branch instructions (and this can be explained by my settings of eager_limit and max send size).
> However something weird already happens at 32K where I clearly see an increase in the number of branches and total instructions. The fact is that there are almost 0 branch instructions until 32KB and starting from 32KB to 16MB there is a linear increase. At 16MB there is another jump and then again linear increase.
> It seems that there is another threshold driving this behavior. I tried to set these other parameters for the SM BTL, btl_sm_fifo_size, btl_sm_exclusivity but nothing changed. For my understanding of MPI, this should be a kind of pipe-lining of the message which is being transferred by chunks (of probably 32KB size).
> How can I override this behavior? Is there any parameter I can set?
> I also noticed that while this is happening for the MPI_Send, the MPI_Recv operation behaves differently. For the receive routine there is no bump in terms of branch and total instructions. The increase is linear starting from 64 bytes. The increase of branch instructions slows down however after the 16MB threshold. My idea about that is that probably the receive is busy waiting for the message and therefore the number of branches grows proportionally with the time spent for the message to arrive.
> This is my hypothesis but you probably know better.
> graphs are attached. Thanks in advance for your help.