Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-05-26 12:21:32

vasilis wrote:

>Dear openMpi users,
>I am trying to develop a code that runs in parallel mode with openMPI (1.3.2
>version). The code is written in Fortran 90, and I am running on a cluster
>If I use 2 CPU the program runs fine, but for a larger number of CPUs I get the
>following error:
>[compute-2-6.local:18491] *** An error occurred in MPI_Recv
>[compute-2-6.local:18491] *** on communicator MPI_COMM_WORLD
>[compute-2-6.local:18491] *** MPI_ERR_TRUNCATE: message truncated
>[compute-2-6.local:18491] *** MPI_ERRORS_ARE_FATAL (your MPI job will now
>Here is the part of the code that this error refers to:
>if( mumps_par%MYID .eq. 0 ) THEN
> res=res+res_cpu
> do iw=1,total_elem_cpu*unique
> jacob(iw)=jacob(iw)+jacob_cpu(iw)
> position_col(iw)=position_col(iw)+col_cpu(iw)
> position_row(iw)=position_row(iw)+row_cpu(iw)
> end do
> do jw=1,nsize-1
> call
> call
> call
> call
> res=res+res_cpu
> do iw=1,total_elem_cpu*unique
> jacob(status1(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
> jacob(status1(MPI_SOURCE)*total_elem_cpu*unique+iw)+jacob_cpu(iw)
> position_col(status4(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
> position_col(status4(MPI_SOURCE)*total_elem_cpu*unique+iw)+col_cpu(iw)
> position_row(status3(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
> position_row(status3(MPI_SOURCE)*total_elem_cpu*unique+iw)+row_cpu(iw)
> end do
> end do
> else
> call
> call
> call
> call
> call MPI_Wait(request1, status1, ierr)
> call MPI_Wait(request2, status2, ierr)
> call MPI_Wait(request3, status3, ierr)
> call MPI_Wait(request4, status4, ierr)
> end if
>I am also using the MUMPS library
>Could someone help to track this error down. Is really annoying to use only
>two processors.
>The cluster has about 8 nodes and each has 4 dual core CPU. I tried to run the
>code on a single node with more than 2 CPU but I got the same error!!
I think the error message means that the received message was longer
than the receive buffer that was specified. If I look at your code and
try to reason about its correctness, I think of the message-passing
portion as looking like this:

if( mumps_par%MYID .eq. 0 ) THEN
    do jw=1,nsize-1
        call MPI_recv( res_cpu,total_unknowns
        call MPI_recv(
        call MPI_recv(
    end do
    call MPI_Send( res_cpu,total_unknowns
    call MPI_Send( row_cpu,total_elem_cpu*unique,MPI_INTEGER
    call MPI_Send( col_cpu,total_elem_cpu*unique,MPI_INTEGER
end if

If you're running on two processes, then the messages you receive are in
the order you expect. If there are more than two processes, however,
certainly messages will start appearing "out of order" and your
indiscriminate use of MPI_ANY_SOURCE and MPI_ANY_TAG will start getting
them mixed up. You won't just get all messages from one rank and then
all from another and then all from another. Rather, the messages from
all these other processes will come interwoven, but you interpret them
in a fixed order.

Here is what I mean. Let's say you have 3 processes. So, rank 0 will
receive 8 messages: 4 from rank 1and 4 from rank 2. Correspondingly,
rank 1 and rank 2 will each send 4 messages to rank 0. Here is a
possibility for the order in which messages are received:

jacob_cpu from rank 1
jacob_cpu from rank 2
res_cpu from rank 1
row_cpu from rank 1
res_cpu from rank 2
row_cpu from rank 2
col_cpu from rank 2
col_cpu from rank 1

Rank 0, however, is trying to unpack these in the order you prescribed
in your code. Data will get misinterpreted. More to the point here,
you will be trying to receive data into buffers of the wrong size (some
of the time).

Maybe you should use tags to distinguish between the different types of
messages you're trying to send.