Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Code Master (cpp.codemaster_at_[hidden])
Date: 2007-06-17 05:27:24


Hi!

I've just tried out openmpi-1.2.3-rc1. I ran my client programs
successfully when nproc < 16. However when the number of nodes >=16,
mpirun hangs (in master only) at the end of the execution after all
processes (includes the client program itself and orted*) exits.

Then I performed ps x on the master node and found mpirun is the only
entry. Apparently it is in the sleeping mode (S+).

So does this give more hint about what went wrong?

Thanks!

On 6/12/07, Ralph H Castain <rhc_at_[hidden]> wrote:
> Hi there
>
> Sorry for the delayed response - I was tied up this weekend and almost
> completely away from the computer. Doesn't happen very often (probably not
> often enough! ;-) )
>
> I can only think of one think you could try with 1.2.2. I note that you have
> enabled mpi threads and progress threads. Do you really need the threading
> capabilities?? If you can possibly live without them, at least for a trial,
> then I would re-configure with --without-threads --disable-mpi-threads
> --disable-progress-threads.
>
> Our threading support is really not that great just yet, so it is entirely
> possible that you are hitting some kind of thread locked condition.
> Unfortunately, it is impossible to tell at this point, though we hopefully
> will have some new diagnostics shortly that will help us developers to debug
> such situations.
>
> I did recently introduce some major changes to the system that *might*
> affect this behavior. However, those are only in our subversion trunk and
> will never be moved to the 1.2 code series - they will be released with the
> 1.3 series sometime late this year/early next year. If you would like, you
> can checkout the trunk and try your code with that version to see if you get
> some improved behavior.
>
> Hope that is of some help. Let me know what you see and I'll try to help you
> out.
>
> Ralph
>
>
> On 6/11/07 4:02 AM, "Code Master" <cpp.codemaster_at_[hidden]> wrote:
>
> > Hi Ralph,
> >
> > I'm using openmpi-1.2.2 to compile and run my client app. After my
> > app and orted processes exit successfully in all master and slave
> > nodes, mpirun hangs in the master node (mpirun has also exited
> > successfully in all slave node.
> >
> > This only happens in openmpi-1.2.2.
> >
> > Can you see why this is happening? (I've included the ./configure
> > script in the records below) Also would you please help me out? I
> > really need to get mpirun working in order to benchmark my parallel
> > programs for my dissertation.
> >
> > Thanks!
> >
> > ---------- Forwarded message ----------
> > From: Code Master <cpp.codemaster_at_[hidden]>
> > Date: Jun 9, 2007 9:44 AM
> > Subject: Re: [OMPI users] mpirun in openmpi-1.2.2 fails to exit after
> > client program finishes
> > To: Open MPI Users <users_at_[hidden]>
> >
> >
> > On 6/9/07, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> >> On Jun 8, 2007, at 9:29 AM, Code Master wrote:
> >>
> >>> I compiled openmpi-1.2.2 with:
> >>>
> >>> ./configure CFLAGS=-g -pg -O3 --prefix=/home/foo/490_research/490/
> >>> src/mpi.optimized_profiling/ \
> >>> --enable-mpi-threads --enable-progress-threads --enable-static --
> >>> disable-shared --without-memory-manager \
> >>> --without-libnuma --disable-mpi-f77 --disable-mpi-f90 --disable-mpi-
> >>> cxx --disable-mpi-cxx-seek --disable-dlopen
> >>>
> >>> (Thanks Jeff, now I know that I have to add --without-memory-
> >>> manager and --without-libnuma for static linking)
> >>
> >> Good.
> >>
> >>> make all
> >>> make install
> >>>
> >>> then I run my client app with:
> >>>
> >>> /home/foo/490_research/490/src/mpi.optimized_profiling/bin/mpirun --
> >>> hostfile ../hostfile -n 32 raytrace -finputs/car.env
> >>>
> >>> The program runs well and each process completes succssfully (I can
> >>> tell because all processes have now generated gmon.out successfully
> >>> and a "ps aux" on other slave nodes (except the originating node)
> >>> show that my program in slave nodes have already exited (not
> >>> existant). Therefore I think this may have something to do with
> >>> mpirun,which hangs forever.
> >>
> >> Be aware that you may have problems with multiple processes writing
> >> to the same gmon.out, unless you're running each instance in a
> >> different directory (your command line doesn't indicate that you are,
> >> but that doesn't necessarily prove anything).
> >
> > I am sure this is not happening, because in my program, after the MPI
> > initialization, the main() invokes chdir() which immediately change
> > the directory to the process's own directory (named after the
> > proc_id). Therefore they all have their own directory to write to.
> >
> >>> Can you see anything wrong in my ./configure command which explains
> >>> the mpirun hang at the end of the run? How can I fix it?
> >>
> >> No, everything looks fine.
> >>
> >> So you confirm that all raytrace instances have exited and all orteds
> >> have exited, leaving *only* mpirun runnning?
> >
> > Yes, I am sure that all raytrace instances as well as all mpi-related
> > processes (including mpirun and orteds etc.) have exited in all slave
> > nodes. In the *master* node, all raytrace instances and all orteds
> > have exited as well, leaving *only* mpirun running in the *master*
> > node.
> >
> > 14818 pts/0 S+ 0:00
> > /home/foo/490_research/490/src/mpi.optimized_profiling/bin/mpirun
> > --hostfile ../hostfile -n 32 raytrace -finputs/car.env -s
> > 1
> >> There was a race condition about this at one point; Ralph -- can you
> >> comment further?
> >>
> >> --
> >> Jeff Squyres
> >> Cisco Systems
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
>
>
>