On Sep 26, 2008, at 1:45 PM, Robert Kubrick wrote:
> I'm not sure how should I interpret this message:
> [local:17344] *** An error occurred in MPI_Testsome
> [local:17344] *** on communicator MPI COMMUNICATOR 5 CREATE FROM 0
> [local:17344] *** MPI_ERR_TRUNCATE: message truncated
> [local:17344] *** MPI_ERRORS_ARE_FATAL (goodbye)
> mpiexec noticed that job rank 0 with PID 17338 on node local exited
> on signal 15 (Terminated).
> 3 additional processes aborted (not shown)
> I am assuming that the error was triggered because one of the
> buffers I set in the MPI_Recv_init() calls can not contain the
> incoming message.
Sorry for the delay in replying.
This is likely the cause -- MPI defines this as a run-time error.
> However, I don't understand why job rank 0 terminates first. The
> only process that contains a call to MPI_Testsome has actually rank
> 3, and it's receiving messages from rank 0.
The aborting process sends a message to kill all the other processes
in the job before it dies itself (i.e., to obey the semantics of an
MPI abort). Hence, it's likely that there's a race going on here and
process 0 dies before 3, so mpirun reports that first.
> Also I think it would be a good idea to print the message tag in the
> error log.
Mm. Good point. I'll file this as a feature request -- we have
centralized error reporting for the abort sequence, so it'll take a
little noodling to get that in there. Probably won't happen for v1.3[.
0], but that's good real-world feedback to have. Thanks!