Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpirun (orte ?) not shutting down cleanly on job aborts
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-06-21 09:26:36

Sorry for the delay in replying to this -- mails sometimes pile up in
my INBOX and I don't get to reply to them all in a timely fashion.

Yes, you can expect this to be much better in the v1.3 series. If you
have a few cycles, you might want to test a nightly trunk tarball
snapshot in some of your problematic cases and see if it's better.
We've had a little instability in trunk tarballs over the last week,
so you might want to wait until next week to give it a shot.

On Jun 9, 2008, at 10:50 AM, Bill Johnstone wrote:

> Hello OMPI devs,
> I'm currently running OMPI v 1.2.4 . It didn't seem that any bugs
> which affect me or my users were fixed in 1.2.5 and 1.2.6, so I
> haven't upgraded yet.
> When I was initially getting started with OpenMPI, I had some
> problems which I was able to solve, but one still remains. As I
> mentioned in
> when there is a non-graceful exit on any of the MPI jobs, mpirun
> hangs. As an example, I have a code that I run which, when it has a
> trivial runtime error (e.g., some small mistake in the input file)
> dies yielding messages to the screen like:
> [node1.x86-64:28556] MPI_ABORT invoked on rank 0 in communicator
> MPI_COMM_WORLD with errorcode 16
> but mpirun never exits, and Ctrl+C won't kill it. I have to resort
> to kill -9.
> Now that I'm running under SLURM, this is worse because there is no
> nice way to manually clear individual jobs off the controller. So
> even if I manually kill mpirun on the failed job, slurmctld still
> thinks its running.
> Ralph Castain replied to the previously-linked message:
> indicating that he thought he knew why this was happening and that
> it was or would likely be fixed in the trunk.
> At this point, I just want to know: can I look forward to this being
> fixed in the upcoming v 1.3 series?
> I don't mean that to sound ungrateful: *many thanks* to the OMPI
> devs for what you've already given the community at large. I'm just
> a bit frustrated because we seem to run a lot of codes on our
> cluster that abort at one time or another.
> Thank you.
> _______________________________________________
> users mailing list
> users_at_[hidden]

Jeff Squyres
Cisco Systems