The accumulation of res_cpu into res starts with rank 0 and then
handles everyone else in arbitrary order (due to MPI_ANY_SOURCE). With
np=2, this means the order is fully deterministic (0 then 1). With
np>2, the order is no longer deterministic. E.g., for np=3, you
could have 0 then 1 then 2, or you could have 0 then 2 then 1.
Rank 0 accumulates all the res_cpu values into a single array, res. It
starts with its own res_cpu and then adds all other processes. When
np=2, that means the order is prescribed. When np>2, the order is no
longer prescribed and some floating-point rounding variations can start
Yes you are right. Now, the question is why would these floating-point rounding
variations occur for np>2? It cannot be due to a not prescribed order!!
Here is another version of the code, without MPI_ANY_SOURCE nor
if( mumps_par%MYID .eq. 0 ) THEN
do jw = 0, nsize-1
if ( jw /= 0 ) then
call MPI_recv( res_cpu,total_unknowns
res (: ) = res (: ) + res_cpu(:)
jacob (:,jw) = jacob(:,jw) + jacob_cpu(:)
position_col(:,jw) = position_col(:,jw) + col_cpu(:)
position_row(:,jw) = position_row(:,jw) + row_cpu(:)
call MPI_Send( res_cpu,total_unknowns
call MPI_Send( row_cpu,total_elem_cpu*unique,MPI_INTEGER
call MPI_Send( col_cpu,total_elem_cpu*unique,MPI_INTEGER
Potentially faster. It allows the underlying MPI implementation to
introduce optimizations (also potentially leading to the nondeterminism
as you have observed!). The other reason to use collective operations,
however, is to make your code more readable.
P.S. It seems to me that you could use MPI collective operations to
implement what you're doing. E.g., something like:
I could use these operations for the res variable (Will it make the summation
You can use an MPI_Gather operation to gather the data to rank 0 and
then perform the summation on-node. You need to decide (based on
performance, readability, etc.) if you want to make that change.
But, I can not use them for the other 3 variables.