Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Occasional mpirun hang on completion
From: Barry Rountree (rountree_at_[hidden])
Date: 2008-01-18 22:13:23


On Fri, Jan 18, 2008 at 08:33:10PM -0500, Jeff Squyres wrote:
> Barry --
>
> Could you check what apps are still running when it hangs? I.e., I
> assume that all the uptime's are dead; are all the orted's dead on the
> remote nodes? (orted = our helper process that is launched on the
> remote nodes to exert process control, funnel I/O back and forth to
> mpirun, etc.)
>
> If any of the orted's are still running, can you connect to them with
> gdb and get a backtrace to see where they are hung?
>
> Likewise, can you connect to mpirun with gdb and get a backtrace of
> where it's hung?
>
> Ralph, the main ORTE developer, is pretty sure that it's stuck in the
> IO flushing routines that are executed at the end of time (look for
> function names like iof_flush or similar). We thought we had fixed
> all of those on the 1.2 branch, but perhaps there's some other weird
> race condition happening that doesn't happen on our test machines...

I'm happy to help. I've got a paper submission deadline on Tuesday, so
it might not be until midweek.

Thanks for the reply,

Barry

>
>
>
> On Jan 13, 2008, at 10:17 AM, Barry Rountree wrote:
>
> > On Sun, Jan 13, 2008 at 09:54:47AM -0500, Barry Rountree wrote:
> > > Hello,
> > >
> > > The following command
> > >
> > > mpirun -np 2 -hostfile ~/hostfile uptime
> > >
> > > will occasionally hang after completing. The expected output
> > appears on
> > > the screen, but mpirun needs a SIGKILL to return to the console.
> > >
> > > This has been verified with OpenMPI v1.2.4 compiled with both icc
> > 9.1
> > > 20061101 (aka 9.1.045) and gcc 4.1.0 20060304 (aka Red Hat
> > 4.1.0-3). I
> > > have also tried earlier versions of OpenMPI and found the same bug
> > > (1.1.2 and 1.2.2).
> > >
> > > Using -verbose didn't provide any additional output. I'm happy
> > to help
> > > tracking down whatever is causing this.
> >
> > A couple more data points:
> >
> > mpirun -np 4 -hostfile ~/hostfile --no-daemonize uptime
> >
> > hung twice over 100 runs. Without the --no-daemonize, the command
> > hung
> > 16 times over 100 runs. (This is using the version compiled with
> > icc.)
> >
> > Barry
> >
> > >
> > > Many thanks,
> > >
> > > Barry Rountree
> > > Ph.D. Candidate, Computer Science
> > > University of Georgia
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users