I had to think about this for awhile, and chatted briefly about it with Jeff, who concurred with my concerns.

We don't believe this is going to work for you within MPI. The problem is that our error recovery procedures for MPI won't avoid losing the current query. Basically, all you can do is (a) periodically checkpoint your application, and (b) when a failure occurs, return the entire system back to the last checkpoint and start over again from there.

Thus, you will always lose the current query, plus anything that happened since the last checkpoint.

I'm not sure why you are using MPI in an SQL engine - are you just trying to run parallel copies of the engine? If so, then you might want to consider doing something more along the lines of the OpenRCM project (a sub-project of OMPI).

ORCM was designed to run multiple copies of applications in parallel, each receiving the same input, so that the failure of any application (or the node it is on) is invisible to any consumer of that application's output. We run databases with it now, as well as control applications where we cannot tolerate any downtime due to a node failure.

We are working hard right now to get out the first production release of ORCM. If you think it might be of use, you are welcome to try it - we are pretty responsive to bugs and/or feature requests.

More about ORCM can be found at http://www.open-mpi.org/projects/orcm. To understand more about how it works, look at the presentation

http://www.open-mpi.org/projects/orcm/papers/cisco-2010

HTH
Ralph

On Jun 23, 2010, at 8:45 AM, Randolph Pullen wrote:

It would be excellent if you could address this in 1.4.x  or provide an alernative as it is an important attribute in fault recovery, particularly with a large number of nodes where the MTBF is significantly lowered; - ie we can expect node failures from time to time.

A bit of background:
I am building a parallel SQL engine for large scale analytics and need to re-map failed nodes to a suitable backup data set, without losing the currently running query.
I am assuming this means re-starting mpirun with adjusted parameters but it may be possible (although probably very messy) to re-start failed processes on backup nodes without losing the current query.

What are your thoughts?

Regards,
Randolph

PS: excellent product, keep up the good work
--- On Thu, 24/6/10, Ralph Castain <rhc@open-mpi.org> wrote:

From: Ralph Castain <rhc@open-mpi.org>
Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun
To: "Open MPI Users" <users@open-mpi.org>
Received: Thursday, 24 June, 2010, 12:00 AM

mpirun is not an MPI process, so it makes no difference what your processes are doing wrt MPI_Abort or any other MPI function call.

A quick glance thru the code shows that mpirun won't properly terminate under these conditions. It is waiting to hear that all daemons have terminated, and obviously is missing the one that was on the node that you powered off.

This obviously isn't a scenario we regularly test. The work Jeff referred to is intended to better handle such situations, but isn't ready for release yet. I'm not sure if I'll have time to go back to the 1.4 series and resolve this behavior, but I'll put it on my list of things to look at if/when time permits.


On Jun 23, 2010, at 6:53 AM, Randolph Pullen wrote:

ok,
Having confirmed that replacing MPI_Abort with exit() does not work and checking that under these conditions the only process left running appears to be mpirun,
I think I need to report a bug, ie:
Although the processes themselves can be stopped (by exit if nothing else)
mpirun hangs after a node is powered off and can never exit as it appears to wait indefinitely for the missing node to receive or send a signal.


--- On Wed, 23/6/10, Jeff Squyres <jsquyres@cisco.com> wrote:

From: Jeff Squyres <jsquyres@cisco.com>
Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun
To: "Open MPI Users" <users@open-mpi.org>
Received: Wednesday, 23 June, 2010, 9:10 PM

Open MPI's fault tolerance support is fairly rudimentary.  If you kill any process without calling MPI_Finalize, Open MPI will -- by default -- kill all the others in the job.

Various research work is ongoing to improve fault tolerance in Open MPI, but I don't know the state of it in terms of surviving a failed process.  I *think* that this kind of stuff is not ready for prime time, but I admit that this is not an area that I pay close attention to.



On Jun 23, 2010, at 3:08 AM, Randolph Pullen wrote:

> That is effectively what I have done by changing to the immediate send/receive and waiting in a loop a finite number of times for the transfers to complete - and calling MPI_abort if they do not complete in a set time.
> It is not clear how I can kill mpirun in a manner consistent with the API.
> Are you implying I should call exit() rather than MPI_abort?
>
> --- On Wed, 23/6/10, David Zhang <solarbikedz@gmail.com> wrote:
>
> From: David Zhang <solarbikedz@gmail.com>
> Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun
> To: "Open MPI Users" <users@open-mpi.org>
> Received: Wednesday, 23 June, 2010, 4:37 PM
>
> Since you turned the machine off instead of just killing one of the processes, no signals could be sent to other processes.  Perhaps you could institute some sort of handshaking in your software that periodically check for the attendance of all machines, and timeout if not all are present within some alloted time?
>
> On Tue, Jun 22, 2010 at 10:43 PM, Randolph Pullen <randolph_pullen@yahoo.com.au> wrote:
>
> I have a mpi program that aggregates data from multiple sql systems.  It all runs fine.  To test fault tolerance I switch one of the machines off while it is running.  The result is always a hang, ie mpirun never completes.

> To try and avoid this I have replaced the send and receive calls with immediate calls (ie MPI_Isend, MPI_Irecv) to try and trap long waiting sends and receives but it makes no difference.
> My requirement is that all complete or mpirun exits with an error - no matter where they are in their execution when a failure occurs.  This system must continue (ie fail)  if a machine dies, regroup and re-cast the job over the remaining nodes.
>
> I am running FC10, gcc 4.3.2 and openMPI 1.4.1
> 4G RAM, dual core intel all x86_64
>
>
> ===============================================================================================================
> The commands I have tried:
> mpirun  -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  "select * from tab"   
>
> mpirun -mca btl ^sm -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  "select * from tab"   
>
> mpirun -mca orte_forward_job_control 1  -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  "select * from tab"   
>
>
> ===============================================================================================================
>
> The results:
> recv returned 0 with status 0
> waited  # 2000002 tiumes - now status is  0 flag is -1976147192
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 5.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 29141 on
> node bd01 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
>
> [*** wait a long time ***]
> [bd01:29136] [[55293,0],0]-[[55293,0],1] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
>
> ^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
>
>
> ===============================================================================================================
>
> As you can see, my trap can signal an abort, the tcp layer can time out but mpirun just keeps on running...
>
> Any help greatly appreciated..
> Vlad
>
>
>

> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> David Zhang
> University of California, San Diego
>
> -----Inline Attachment Follows-----
>
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>  _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquyres@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

 _______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


-----Inline Attachment Follows-----

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

 _______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users