In my code I implement mpi_send/mpi_receive for an three dimensional real array, and process is as follows:
all other processors send the array to rank 0 and then rank 0 receives the array and put these arrays into a complete array. Then mpi_bcast is called to send the complete array from rank 0 to all others.
This pattern of communication reminds me of an MPI_Allgather (or the more flexible version MPI_Allgatherv).
This is very basic usage of mpi_send and mpi_receive. In my fortran code I found that if I added call mpi_barrier(...) before the mpi_send and mpi_receive statements the wall time (60s) for this sending and receiving will be much lower than that if mpi_barrier is not called (2s). I used mpi_wtime to count the time.
In a parallel application each process is out of sync to the others. I have no idea how you measure your time in the original version but I guess that in the MPI_Barrier case you start your timer after the barrier. As the barrier put in sync all processes, you only measure the real time to exchange the data, which might seem shorter.
I think mpi_send and mpi_recv are blocking subroutines and thus no additional mpi_barrier is needed. Can anybody tell me what is the reason for this phenomena? Thank you very much.
Yes, these operations are indeed blocking, which is why you see the slowdown. If one single process is late to send its contribution, the entire operation is be penalized (as the root , aka. process zero, is waiting for contributions in order). So you should either try to use the collective pattern I highlighted before, switch to using non-blocking point-to-point instead of blocking, or look into the potential benefit of using a non-blocking collective.
users mailing list