Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU
From: vasilis (gkanis_at_[hidden])
Date: 2009-05-28 03:26:11

> This is a problem of numerical stability, and there is no solution for
> such a problem in MPI. Usually, preconditioning the input matrix
> improve the numerical stability.

It could be a numerical stability but this would imply that I have an ill-
conditioned matrix. This is not my case.

> If you read the MPI standard, there is a __short__ section about what
> guarantees the MPI collective communications provide. There is only
> one: if you run the same collective twice, on the same set of nodes
> with the same input data, you will get the same output. In fact the
> main problem is that MPI consider all default operations (MPI_OP) as
> being commutative and associative, which is usually the case in real
> world but not when floating point rounding is around. When you
> increase the number of nodes, the data will be spread in smaller
> pieces, which means more operations will have to be done in order to
> achieve the reduction, i.e. more rounding errors might occur and so on.

You could have a point if I would see these small differences in both matrices.
I am solving the system Ax=b with the MUMPS libraries. The construction of the
matrix A and the matrix-column b is distributed among np CPU. The matrix A is
the same whether I use 2CPUs or np CPUs. The matrix b would slightly change if
I use more than 2CPUs.

My data are not spread in smaller pieces!! I am using the FEM to solve the
system of equations, and I use MPI to partition the domain. Therefore, the
data (i.e., the vector of unknowns) is the same in all the CPUs, and each CPU
is constructing a portion of the matrices A,b. Then, in the host CPU I add all
these pieces into A and b.

Thank you,

> Thanks,
> george.
> On May 27, 2009, at 11:16 , vasilis wrote:
> >> Rank 0 accumulates all the res_cpu values into a single array,
> >> res. It
> >> starts with its own res_cpu and then adds all other processes. When
> >> np=2, that means the order is prescribed. When np>2, the order is no
> >> longer prescribed and some floating-point rounding variations can
> >> start
> >> to occur.
> >
> > Yes you are right. Now, the question is why would these floating-
> > point rounding
> > variations occur for np>2? It cannot be due to a not prescribed
> > order!!
> >
> >> If you want results to be more deterministic, you need to fix the
> >> order
> >> in which res is aggregated. E.g., instead of using MPI_ANY_SOURCE,
> >> loop
> >> over the peer processes in a specific order.
> >>
> >> P.S. It seems to me that you could use MPI collective operations to
> >> implement what you're doing. E.g., something like:
> >
> > I could use these operations for the res variable (Will it make the
> > summation
> > any faster?). But, I can not use them for the other 3 variables.
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> >
> _______________________________________________
> users mailing list
> users_at_[hidden]