Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] possible bug in 1.3.2 sm transport
From: Bryan Lally (lally_at_[hidden])
Date: 2009-07-14 17:28:50


I was about to test 1.3.3rc2, then I saw that 1.3.3 had also escaped. I
tried it, and voila! It solves the issue I reported in May, below.

Thanks for all the work that went into this.

        - Bryan

Bryan Lally, lally_at_[hidden]
Los Alamos National Laboratory
Los Alamos, New Mexico
Bryan Lally wrote:
> Developers,
> This is my first post to the openmpi developers list.  I think I've run 
> across a race condition in your latest release.  Since my demonstrator 
> is somewhat large and cumbersome, I'd like to know if you already know 
> about this issue before we start the process of providing code and details.
> Basics: openmpi 1.3.2, Fedora 9, 2 x86_64 quad-core cpus in one machine.
> Symptoms: our code hangs, always in the same vicinity, usually at the 
> same place, 10-25% of the time.  Sometimes more often, sometimes less.
> Our code has run reliably with many MPI implementations for years.  We 
> haven't added anything recently that is a likely culprit.  While we have 
> our own issues, this doesn't feel like one of ours.
> We see that there is new code in the shared memory transport between 
> 1.3.1 and 1.3.2.  Our code doesn't hang with 1.3.1 (nor 1.2.9).  Only 
> with 1.3.2.
> If we switch to tcp for transport (with mpirun --mca btl tcp,self ...) 
> we don't see any hangs.  Running using --mca btl sm,self results in hangs.
> If we sprinkle a few calls (3) to MPI_Barrier in the vicinity of the 
> problem, we no longer see hangs.
> We demonstrate this with 4 processes.  When we attach a debugger to the 
> hung processes, we see that the hang results from an MPI_Allreduce.  All 
> processes have made the same call to MPI_Allreduce.  The processes are 
> all in opal_progress, called (with intervening calls) by MPI_Allreduce.
> My question is, have you seen anything like this before?  If not, what 
> do we do next?
> Thanks.
>     - Bryan