Dear all,

I would like you to ask for a topic that there are already many questions but I am not familiar a lot with it. I want to understand the behaviour of an application where there are many messages less than 64KB (eager mode) and I use TCP network. I am trying to understand in order to simulate this application.
For example it can be possible to have one MPI_Send of 1200 bytes after some computation, then two messages of the same size, after computation, etc. However according to the measurements and the profiling the cost of the communication is less than the latency of the network. I can understand that the cost of the MPI_Send is the copy to the buffer however sending the message to the destination it should cost at least the latency. So are the messages buffered in the sender and they are sent as packet to the receiver? My tcp window is 4MB and I use the same value for snd_buff and rcv_buff. If they are buffered in the sender what is the criterion/algorithm? I mean if I have one message, after computation and after again message is it possible these two messages to be buffered from the sender point of view or this happens only on the receiver? If there is any document/paper that I can read about this I would be appreciate to provide me the link.
A simple example is that if I have a loop that rank 0 sends two messages to rank 1 then the duration of the first message is bigger than the second's one and if I increase the loop to 10 or 20 messages then all the messages cost a lot less than the first one and also less from what SkaMPI measures. So I am sure that it should be a buffer issue (or something else that I can't think about).

Best regards,