Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] mpirun (orte ?) not shutting down cleanly on job aborts
From: Bill Johnstone (beejstone3_at_[hidden])
Date: 2008-06-09 10:50:01


Hello OMPI devs,

I'm currently running OMPI v 1.2.4 . It didn't seem that any bugs which affect me or my users were fixed in 1.2.5 and 1.2.6, so I haven't upgraded yet.

When I was initially getting started with OpenMPI, I had some problems which I was able to solve, but one still remains. As I mentioned in
http://www.open-mpi.org/community/lists/users/2007/07/3716.php

when there is a non-graceful exit on any of the MPI jobs, mpirun hangs. As an example, I have a code that I run which, when it has a trivial runtime error (e.g., some small mistake in the input file) dies yielding messages to the screen like:

[node1.x86-64:28556] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 16

but mpirun never exits, and Ctrl+C won't kill it. I have to resort to kill -9.

Now that I'm running under SLURM, this is worse because there is no nice way to manually clear individual jobs off the controller. So even if I manually kill mpirun on the failed job, slurmctld still thinks its running.

Ralph Castain replied to the previously-linked message:
http://www.open-mpi.org/community/lists/users/2007/07/3718.php indicating that he thought he knew why this was happening and that it was or would likely be fixed in the trunk.

At this point, I just want to know: can I look forward to this being fixed in the upcoming v 1.3 series?

I don't mean that to sound ungrateful: *many thanks* to the OMPI devs for what you've already given the community at large. I'm just a bit frustrated because we seem to run a lot of codes on our cluster that abort at one time or another.

Thank you.