vasilis wrote:
Rank 0 accumulates all the res_cpu values into a single array, res. It
starts with its own res_cpu and then adds all other processes. When
np=2, that means the order is prescribed. When np>2, the order is no
longer prescribed and some floating-point rounding variations can start
to occur.
Yes you are right. Now, the question is why would these floating-point rounding
variations occur for np>2? It cannot be due to a not prescribed order!!
The accumulation of res_cpu into res starts with rank 0 and then
handles everyone else in arbitrary order (due to MPI_ANY_SOURCE). With
np=2, this means the order is fully deterministic (0 then 1). With
np>2, the order is no longer deterministic. E.g., for np=3, you
could have 0 then 1 then 2, or you could have 0 then 2 then 1.
Here is another version of the code, without MPI_ANY_SOURCE nor
MPI_ANY_TAG:
if( mumps_par%MYID .eq. 0 ) THEN
do jw = 0, nsize-1
if ( jw /= 0 ) then
call
MPI_recv(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,jw,5,MPI_COMM_WORLD,status1,ierr)
call MPI_recv( res_cpu,total_unknowns
,MPI_DOUBLE_PRECISION,jw,6,MPI_COMM_WORLD,status2,ierr)
call MPI_recv(
row_cpu,total_elem_cpu*unique,MPI_INTEGER
,jw,7,MPI_COMM_WORLD,status3,ierr)
call MPI_recv(
col_cpu,total_elem_cpu*unique,MPI_INTEGER
,jw,8,MPI_COMM_WORLD,status4,ierr)
end if
res (: ) = res (: ) + res_cpu(:)
jacob (:,jw) = jacob(:,jw) + jacob_cpu(:)
position_col(:,jw) = position_col(:,jw) + col_cpu(:)
position_row(:,jw) = position_row(:,jw) + row_cpu(:)
end do
else
call
MPI_Send(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,0,5,MPI_COMM_WORLD,ierr)
call MPI_Send( res_cpu,total_unknowns
,MPI_DOUBLE_PRECISION,0,6,MPI_COMM_WORLD,ierr)
call MPI_Send( row_cpu,total_elem_cpu*unique,MPI_INTEGER
,0,7,MPI_COMM_WORLD,ierr)
call MPI_Send( col_cpu,total_elem_cpu*unique,MPI_INTEGER
,0,8,MPI_COMM_WORLD,ierr)
end if
P.S. It seems to me that you could use MPI collective operations to
implement what you're doing. E.g., something like:
I could use these operations for the res variable (Will it make the summation
any faster?).
Potentially faster. It allows the underlying MPI implementation to
introduce optimizations (also potentially leading to the nondeterminism
as you have observed!). The other reason to use collective operations,
however, is to make your code more readable.
But, I can not use them for the other 3 variables.
You can use an MPI_Gather operation to gather the data to rank 0 and
then perform the summation on-node. You need to decide (based on
performance, readability, etc.) if you want to make that change.
|
|