Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Proper way to throw an error to all nodes?
From: David Singleton (David.Singleton_at_[hidden])
Date: 2008-06-03 23:00:01

This is exactly what MPI_Abort is for.


Terry Frankcombe wrote:
> Calling MPI_Finalize in a single process won't ever do what you want.
> You need to get all the processes to call MPI_Finalize for the end to be
> graceful.
> What you need to do is have some sort of special message to tell
> everyone to die. In my codes I have a rather dynamic master-slave model
> with flags being broadcast by the master process to tell the slaves what
> to expect next, so it's easy for me to send out an "it's all over,
> please kill yourself" message. For a more rigid communication pattern
> you could embed the die message in the data: something like if the first
> element of the received data is negative, then that's the sign things
> have gone south and everyone should stop what they're doing and
> MPI_Finalize. The details depend on the details of your code.
> Presumably you could also set something up using tags and message
> polling.
> Hope this helps.
> On Tue, 2008-06-03 at 19:57 +0900, 8mj6tc902_at_[hidden] wrote:
>> So I'm working on this program which has many ways it might possibly die
>> at runtime, but one of them that happens frequently is the user types a
>> wrong (non-existant) filename on the command prompt. As it is now, the
>> node looking for the file notices the file doesn't exist and tries to
>> terminate the program. It tries to call MPI_Finalize(), but the other
>> nodes are all waiting for a message from the node doing the file
>> reading, so MPI_Finalize waits forever until the user realizes the job
>> isn't doing anything and terminates it manually.
>> So, my question is: what's the "correct" graceful way to handle
>> situations like this? Is there some MPI function which can basically
>> throw an exception to all other nodes telling them bail out now? Or is
>> correct behaviour just to have the node that spotted the error die
>> quietly and wait for the others to notice?
>> Thanks for any suggestions!