How did you configure Open MPI? Is your application using SIGUSR1?
This error message indicates that Open MPI's daemons could not
communicate with the application processes. The daemons send SIGUSR1
to the process to initiate the handshake (you can change this signal
with -mca opal_cr_signal). If your application does not respond to the
daemon within a time bound (default 20 sec, though you can change it
with -mca snapc_full_max_wait_time) then this error is printed, and
the checkpoint is aborted.
On Sep 22, 2009, at 1:43 AM, Mallikarjuna Shastry wrote:
> users mailing list