Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU
From: vasilis (gkanis_at_[hidden])
Date: 2009-05-27 06:09:27


Thank you Eugene for your suggestion. I used different tags for each variable,
and now I do not get this error.
The problem now is that I am getting a different solution when I use more than
2 CPUs. I checked the matrices and I found that they differ by a very small
amount of the order 10^(-10). Actually, I am getting a different solution if I
use 4CPUs or 16CPUs!!!
Do you have any idea what could cause this behavior?

Thank you,
Vasilis

On Tuesday 26 of May 2009 7:21:32 pm you wrote:
> vasilis wrote:
> >Dear openMpi users,
> >
> >I am trying to develop a code that runs in parallel mode with openMPI
> > (1.3.2 version). The code is written in Fortran 90, and I am running on
> > a cluster
> >
> >If I use 2 CPU the program runs fine, but for a larger number of CPUs I
> > get the following error:
> >
> >[compute-2-6.local:18491] *** An error occurred in MPI_Recv
> >[compute-2-6.local:18491] *** on communicator MPI_COMM_WORLD
> >[compute-2-6.local:18491] *** MPI_ERR_TRUNCATE: message truncated
> >[compute-2-6.local:18491] *** MPI_ERRORS_ARE_FATAL (your MPI job will now
> >abort)
> >
> >Here is the part of the code that this error refers to:
> >if( mumps_par%MYID .eq. 0 ) THEN
> > res=res+res_cpu
> > do iw=1,total_elem_cpu*unique
> > jacob(iw)=jacob(iw)+jacob_cpu(iw)
> > position_col(iw)=position_col(iw)+col_cpu(iw)
> > position_row(iw)=position_row(iw)+row_cpu(iw)
> > end do
> >
> > do jw=1,nsize-1
> > call
> >MPI_recv(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,MPI_ANY_SOUR
> >CE,MPI_ANY_TAG,MPI_COMM_WORLD,status1,ierr) call
> >MPI_recv(res_cpu,total_unknowns,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_AN
> >Y_TAG,MPI_COMM_WORLD,status2,ierr) call
> >MPI_recv(row_cpu,total_elem_cpu*unique,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_
> >TAG,MPI_COMM_WORLD,status3,ierr) call
> >MPI_recv(col_cpu,total_elem_cpu*unique,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_
> >TAG,MPI_COMM_WORLD,status4,ierr)
> >
> > res=res+res_cpu
> > do iw=1,total_elem_cpu*unique
> >
> > jacob(status1(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
> > jacob(status1(MPI_SOURCE)*total_elem_cpu*unique+iw)+jacob_cpu(iw)
> > position_col(status4(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
> > position_col(status4(MPI_SOURCE)*total_elem_cpu*unique+iw)+col_cpu(iw)
> > position_row(status3(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
> > position_row(status3(MPI_SOURCE)*total_elem_cpu*unique+iw)+row_cpu(iw)
> > end do
> > end do
> > else
> > call
> >MPI_Isend(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,0,mumps_par
> >%MYID,MPI_COMM_WORLD,request1,ierr) call
> >MPI_Isend(res_cpu,total_unknowns,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI
> >_COMM_WORLD,request2,ierr) call
> >MPI_Isend(row_cpu,total_elem_cpu*unique,MPI_INTEGER,0,mumps_par%MYID,MPI_C
> >OMM_WORLD,request3,ierr) call
> >MPI_Isend(col_cpu,total_elem_cpu*unique,MPI_INTEGER,0,mumps_par%MYID,MPI_C
> >OMM_WORLD,request4,ierr) call MPI_Wait(request1, status1, ierr)
> > call MPI_Wait(request2, status2, ierr)
> > call MPI_Wait(request3, status3, ierr)
> > call MPI_Wait(request4, status4, ierr)
> > end if
> >
> >
> >I am also using the MUMPS library
> >
> >Could someone help to track this error down. Is really annoying to use
> > only two processors.
> >The cluster has about 8 nodes and each has 4 dual core CPU. I tried to run
> > the code on a single node with more than 2 CPU but I got the same error!!
>
> I think the error message means that the received message was longer
> than the receive buffer that was specified. If I look at your code and
> try to reason about its correctness, I think of the message-passing
> portion as looking like this:
>
> if( mumps_par%MYID .eq. 0 ) THEN
> do jw=1,nsize-1
> call
> MPI_recv(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,MPI_ANY_SOURC
>E,MPI_ANY_TAG,MPI_COMM_WORLD,status1,ierr) call MPI_recv(
> res_cpu,total_unknowns
> ,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status2,ier
>r) call MPI_recv(
> row_cpu,total_elem_cpu*unique,MPI_INTEGER
> ,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status3,ierr)
> call MPI_recv(
> col_cpu,total_elem_cpu*unique,MPI_INTEGER
> ,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status4,ierr)
> end do
> else
> call
> MPI_Send(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,0,mumps_par%M
>YID,MPI_COMM_WORLD,ierr) call MPI_Send( res_cpu,total_unknowns
> ,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)
> call MPI_Send( row_cpu,total_elem_cpu*unique,MPI_INTEGER
> ,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)
> call MPI_Send( col_cpu,total_elem_cpu*unique,MPI_INTEGER
> ,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)
> end if
>
> If you're running on two processes, then the messages you receive are in
> the order you expect. If there are more than two processes, however,
> certainly messages will start appearing "out of order" and your
> indiscriminate use of MPI_ANY_SOURCE and MPI_ANY_TAG will start getting
> them mixed up. You won't just get all messages from one rank and then
> all from another and then all from another. Rather, the messages from
> all these other processes will come interwoven, but you interpret them
> in a fixed order.
>
> Here is what I mean. Let's say you have 3 processes. So, rank 0 will
> receive 8 messages: 4 from rank 1and 4 from rank 2. Correspondingly,
> rank 1 and rank 2 will each send 4 messages to rank 0. Here is a
> possibility for the order in which messages are received:
>
> jacob_cpu from rank 1
> jacob_cpu from rank 2
> res_cpu from rank 1
> row_cpu from rank 1
> res_cpu from rank 2
> row_cpu from rank 2
> col_cpu from rank 2
> col_cpu from rank 1
>
> Rank 0, however, is trying to unpack these in the order you prescribed
> in your code. Data will get misinterpreted. More to the point here,
> you will be trying to receive data into buffers of the wrong size (some
> of the time).
>
> Maybe you should use tags to distinguish between the different types of
> messages you're trying to send.
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users