Could you check what apps are still running when it hangs? I.e., I
assume that all the uptime's are dead; are all the orted's dead on the
remote nodes? (orted = our helper process that is launched on the
remote nodes to exert process control, funnel I/O back and forth to
If any of the orted's are still running, can you connect to them with
gdb and get a backtrace to see where they are hung?
Likewise, can you connect to mpirun with gdb and get a backtrace of
where it's hung?
Ralph, the main ORTE developer, is pretty sure that it's stuck in the
IO flushing routines that are executed at the end of time (look for
function names like iof_flush or similar). We thought we had fixed
all of those on the 1.2 branch, but perhaps there's some other weird
race condition happening that doesn't happen on our test machines...
On Jan 13, 2008, at 10:17 AM, Barry Rountree wrote:
> On Sun, Jan 13, 2008 at 09:54:47AM -0500, Barry Rountree wrote:
> > Hello,
> > The following command
> > mpirun -np 2 -hostfile ~/hostfile uptime
> > will occasionally hang after completing. The expected output
> appears on
> > the screen, but mpirun needs a SIGKILL to return to the console.
> > This has been verified with OpenMPI v1.2.4 compiled with both icc
> > 20061101 (aka 9.1.045) and gcc 4.1.0 20060304 (aka Red Hat
> 4.1.0-3). I
> > have also tried earlier versions of OpenMPI and found the same bug
> > (1.1.2 and 1.2.2).
> > Using -verbose didn't provide any additional output. I'm happy
> to help
> > tracking down whatever is causing this.
> A couple more data points:
> mpirun -np 4 -hostfile ~/hostfile --no-daemonize uptime
> hung twice over 100 runs. Without the --no-daemonize, the command
> 16 times over 100 runs. (This is using the version compiled with
> > Many thanks,
> > Barry Rountree
> > Ph.D. Candidate, Computer Science
> > University of Georgia
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> users mailing list