Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Occasional mpirun hang on completion
From: Barry Rountree (rountree_at_[hidden])
Date: 2008-01-24 03:25:56


On Thu, Jan 24, 2008 at 03:01:40AM -0500, Barry Rountree wrote:
> On Fri, Jan 18, 2008 at 08:33:10PM -0500, Jeff Squyres wrote:
> > Barry --
> >
> > Could you check what apps are still running when it hangs? I.e., I
> > assume that all the uptime's are dead; are all the orted's dead on the
> > remote nodes? (orted = our helper process that is launched on the
> > remote nodes to exert process control, funnel I/O back and forth to
> > mpirun, etc.)

One more bit of trivia -- when I ran my killall script across the nodes,
there were four out of sixteen that had an orted process hanging out.
If this is a synchronization problem, then most of the nodes are
handling it fine.

>
> Here's the stack trace of the orted process on node 01. The "uname"
> process was long gone (and had sent its output back with no difficulty).
>
> ============
> Stopping process localhost:5321 (/osr/users/rountree/ompi-1.2.4_intel_threaded_debug/bin/orted).
> Thread received signal INT
> stopped at [<opaque> pthread_cond_wait@@GLIBC_2.3.2(...) 0x00002aaaab67a766]
> (idb) where
> >0 0x00002aaaab67a766 in pthread_cond_wait@@GLIBC_2.3.2(...) in /lib64/libpthread-2.4.so
> #1 0x0000000000401fef in opal_condition_wait(c=0x5075c0, m=0x507580) "../../../opal/threads/condition.h":64
> #2 0x0000000000403000 in main(argc=17, argv=0x7ffffd82cd38) "orted.c":525
> #3 0x00002aaaab7a6e54 in __libc_start_main(...) in /lib64/libc-2.4.so
> #4 0x0000000000401c19 in _start(...) in /osr/users/rountree/ompi-1.2.4_intel_threaded_debug/bin/orted
> ============
>
> The mpirun process on the root node isn't quite as useful.
>
>
> ============
> Stopping process localhost:29856 (/osr/users/rountree/ompi-1.2.4_intel_threaded_debug/bin/orterun).
> Thread received signal INT
> stopped at [<opaque> poll(...) 0x00000039ef2c3806]
> (idb) where
> >0 0x00000039ef2c3806 in poll(...) in /lib64/libc-2.4.so
> #1 0x0000000040a000c0
> ============
>
> Let me know what other information would be helpful.
>
> Best,
>
> Barry