Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU
From: vasilis (gkanis_at_[hidden])
Date: 2009-05-28 03:27:26


On Wednesday 27 of May 2009 7:47:06 pm Damien Hocking wrote:
> I've seen this behaviour with MUMPS on shared-memory machines as well
> using MPI. I use the iterative refinement capability to sharpen the
> last few digits of the solution ( 2 or 3 iterations is usually enough).
> If you're not using that, give it a try, it will probably reduce the
> noise you're getting in your results. The quality of the answer from a
> direct solve is highly dependent on the matrix scaling and pivot order
> and it's easy to get differences in the last few digits. MUMPS itself
> is also asynchronous, and might not be completely deterministic in how
> it solves if MPI processes can run in a different order.

I will check that..

Thank you,
Vasilis

>
> Damien
>
> George Bosilca wrote:
> > This is a problem of numerical stability, and there is no solution for
> > such a problem in MPI. Usually, preconditioning the input matrix
> > improve the numerical stability.
> >
> > If you read the MPI standard, there is a __short__ section about what
> > guarantees the MPI collective communications provide. There is only
> > one: if you run the same collective twice, on the same set of nodes
> > with the same input data, you will get the same output. In fact the
> > main problem is that MPI consider all default operations (MPI_OP) as
> > being commutative and associative, which is usually the case in real
> > world but not when floating point rounding is around. When you
> > increase the number of nodes, the data will be spread in smaller
> > pieces, which means more operations will have to be done in order to
> > achieve the reduction, i.e. more rounding errors might occur and so on.
> >
> > Thanks,
> > george.
> >
> > On May 27, 2009, at 11:16 , vasilis wrote:
> >>> Rank 0 accumulates all the res_cpu values into a single array, res. It
> >>> starts with its own res_cpu and then adds all other processes. When
> >>> np=2, that means the order is prescribed. When np>2, the order is no
> >>> longer prescribed and some floating-point rounding variations can start
> >>> to occur.
> >>
> >> Yes you are right. Now, the question is why would these
> >> floating-point rounding
> >> variations occur for np>2? It cannot be due to a not prescribed order!!
> >>
> >>> If you want results to be more deterministic, you need to fix the order
> >>> in which res is aggregated. E.g., instead of using MPI_ANY_SOURCE,
> >>> loop
> >>> over the peer processes in a specific order.
> >>>
> >>> P.S. It seems to me that you could use MPI collective operations to
> >>> implement what you're doing. E.g., something like:
> >>
> >> I could use these operations for the res variable (Will it make the
> >> summation
> >> any faster?). But, I can not use them for the other 3 variables.
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users