Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-05-26 12:21:32


vasilis wrote:

>Dear openMpi users,
>
>I am trying to develop a code that runs in parallel mode with openMPI (1.3.2
>version). The code is written in Fortran 90, and I am running on a cluster
>
>If I use 2 CPU the program runs fine, but for a larger number of CPUs I get the
>following error:
>
>[compute-2-6.local:18491] *** An error occurred in MPI_Recv
>[compute-2-6.local:18491] *** on communicator MPI_COMM_WORLD
>[compute-2-6.local:18491] *** MPI_ERR_TRUNCATE: message truncated
>[compute-2-6.local:18491] *** MPI_ERRORS_ARE_FATAL (your MPI job will now
>abort)
>
>Here is the part of the code that this error refers to:
>if( mumps_par%MYID .eq. 0 ) THEN
> res=res+res_cpu
> do iw=1,total_elem_cpu*unique
> jacob(iw)=jacob(iw)+jacob_cpu(iw)
> position_col(iw)=position_col(iw)+col_cpu(iw)
> position_row(iw)=position_row(iw)+row_cpu(iw)
> end do
>
> do jw=1,nsize-1
> call
>MPI_recv(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status1,ierr)
> call
>MPI_recv(res_cpu,total_unknowns,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status2,ierr)
> call
>MPI_recv(row_cpu,total_elem_cpu*unique,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status3,ierr)
> call
>MPI_recv(col_cpu,total_elem_cpu*unique,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status4,ierr)
>
> res=res+res_cpu
> do iw=1,total_elem_cpu*unique
> jacob(status1(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
> jacob(status1(MPI_SOURCE)*total_elem_cpu*unique+iw)+jacob_cpu(iw)
> position_col(status4(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
> position_col(status4(MPI_SOURCE)*total_elem_cpu*unique+iw)+col_cpu(iw)
> position_row(status3(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
> position_row(status3(MPI_SOURCE)*total_elem_cpu*unique+iw)+row_cpu(iw)
> end do
> end do
> else
> call
>MPI_Isend(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,request1,ierr)
> call
>MPI_Isend(res_cpu,total_unknowns,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,request2,ierr)
> call
>MPI_Isend(row_cpu,total_elem_cpu*unique,MPI_INTEGER,0,mumps_par%MYID,MPI_COMM_WORLD,request3,ierr)
> call
>MPI_Isend(col_cpu,total_elem_cpu*unique,MPI_INTEGER,0,mumps_par%MYID,MPI_COMM_WORLD,request4,ierr)
> call MPI_Wait(request1, status1, ierr)
> call MPI_Wait(request2, status2, ierr)
> call MPI_Wait(request3, status3, ierr)
> call MPI_Wait(request4, status4, ierr)
> end if
>
>
>I am also using the MUMPS library
>
>Could someone help to track this error down. Is really annoying to use only
>two processors.
>The cluster has about 8 nodes and each has 4 dual core CPU. I tried to run the
>code on a single node with more than 2 CPU but I got the same error!!
>
>
I think the error message means that the received message was longer
than the receive buffer that was specified. If I look at your code and
try to reason about its correctness, I think of the message-passing
portion as looking like this:

if( mumps_par%MYID .eq. 0 ) THEN
    do jw=1,nsize-1
        call
MPI_recv(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status1,ierr)
        call MPI_recv( res_cpu,total_unknowns
,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status2,ierr)
        call MPI_recv(
row_cpu,total_elem_cpu*unique,MPI_INTEGER
,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status3,ierr)
        call MPI_recv(
col_cpu,total_elem_cpu*unique,MPI_INTEGER
,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status4,ierr)
    end do
else
    call
MPI_Send(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)
    call MPI_Send( res_cpu,total_unknowns
,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)
    call MPI_Send( row_cpu,total_elem_cpu*unique,MPI_INTEGER
,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)
    call MPI_Send( col_cpu,total_elem_cpu*unique,MPI_INTEGER
,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)
end if

If you're running on two processes, then the messages you receive are in
the order you expect. If there are more than two processes, however,
certainly messages will start appearing "out of order" and your
indiscriminate use of MPI_ANY_SOURCE and MPI_ANY_TAG will start getting
them mixed up. You won't just get all messages from one rank and then
all from another and then all from another. Rather, the messages from
all these other processes will come interwoven, but you interpret them
in a fixed order.

Here is what I mean. Let's say you have 3 processes. So, rank 0 will
receive 8 messages: 4 from rank 1and 4 from rank 2. Correspondingly,
rank 1 and rank 2 will each send 4 messages to rank 0. Here is a
possibility for the order in which messages are received:

jacob_cpu from rank 1
jacob_cpu from rank 2
res_cpu from rank 1
row_cpu from rank 1
res_cpu from rank 2
row_cpu from rank 2
col_cpu from rank 2
col_cpu from rank 1

Rank 0, however, is trying to unpack these in the order you prescribed
in your code. Data will get misinterpreted. More to the point here,
you will be trying to receive data into buffers of the wrong size (some
of the time).

Maybe you should use tags to distinguish between the different types of
messages you're trying to send.