Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] more Bugs in MPI_Abort() -- mpirun
From: Randolph Pullen (randolph_pullen_at_[hidden])
Date: 2010-06-23 01:43:02


I have a mpi program that aggregates data from multiple sql systems.  It all runs fine.  To test fault tolerance I switch one of the machines off while it is running.  The result is always a hang, ie mpirun never completes.
 
To try and avoid this I have replaced the send and receive calls with immediate calls (ie MPI_Isend, MPI_Irecv) to try and trap long waiting sends and receives but it makes no difference.
My requirement is that all complete or mpirun exits with an error - no matter where they are in their execution when a failure occurs.  This system must continue (ie fail)  if a machine dies, regroup and re-cast the job over the remaining nodes.

I am running FC10, gcc 4.3.2 and openMPI 1.4.1
4G RAM, dual core intel all x86_64

===============================================================================================================
The commands I have tried:
mpirun  -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  "select * from tab"  

mpirun -mca btl ^sm -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  "select * from tab"   

mpirun -mca orte_forward_job_control 1  -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  "select * from tab"   

===============================================================================================================

The results:
recv returned 0 with status 0
waited  # 2000002 tiumes - now status is  0 flag is -1976147192
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 5.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 29141 on
node bd01 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

[*** wait a long time ***]
[bd01:29136] [[55293,0],0]-[[55293,0],1] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)

^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly terminate

===============================================================================================================

As you can see, my trap can signal an abort, the tcp layer can time out but mpirun just keeps on running...

Any help greatly appreciated..
Vlad