Numeric differences are to be expected with parallel applications. The basic reason for that is that on many architectures floating-point operations are performed using higher internal precision than that of the arguments and only the final result is rounded back to the lower output precision. When performing the same operation in parallel, intermediate results are communicated using the lower precision and thus the final result could differ. How much it would differ depends on the stability of the algorithm - it could be a slight difference in the last 1-2 significant bits or it could be a completely different result (e.g. when integrating chaotic dynamic systems).
In your particular case with one process the MPI_Reduce is actually reduced to a no-op and the summing is done entirely in the preceding loop. With two processes the sum is broken into two parts which are computed with higher precision but converted to float before being communicated.
You could try to “cure” this (non-problem) by telling your compiler to not use higher precision for intermediate results.
Hope that helps,
Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23, D 52074 Aachen (Germany)