Developers,
I was about to test 1.3.3rc2, then I saw that 1.3.3 had also escaped. I
tried it, and voila! It solves the issue I reported in May, below.
Thanks for all the work that went into this.
- Bryan
--
Bryan Lally, lally_at_[hidden]
505.667.9954
CCS-2
Los Alamos National Laboratory
Los Alamos, New Mexico
Bryan Lally wrote:
> Developers,
>
> This is my first post to the openmpi developers list. I think I've run
> across a race condition in your latest release. Since my demonstrator
> is somewhat large and cumbersome, I'd like to know if you already know
> about this issue before we start the process of providing code and details.
>
> Basics: openmpi 1.3.2, Fedora 9, 2 x86_64 quad-core cpus in one machine.
>
> Symptoms: our code hangs, always in the same vicinity, usually at the
> same place, 10-25% of the time. Sometimes more often, sometimes less.
>
> Our code has run reliably with many MPI implementations for years. We
> haven't added anything recently that is a likely culprit. While we have
> our own issues, this doesn't feel like one of ours.
>
> We see that there is new code in the shared memory transport between
> 1.3.1 and 1.3.2. Our code doesn't hang with 1.3.1 (nor 1.2.9). Only
> with 1.3.2.
>
> If we switch to tcp for transport (with mpirun --mca btl tcp,self ...)
> we don't see any hangs. Running using --mca btl sm,self results in hangs.
>
> If we sprinkle a few calls (3) to MPI_Barrier in the vicinity of the
> problem, we no longer see hangs.
>
> We demonstrate this with 4 processes. When we attach a debugger to the
> hung processes, we see that the hang results from an MPI_Allreduce. All
> processes have made the same call to MPI_Allreduce. The processes are
> all in opal_progress, called (with intervening calls) by MPI_Allreduce.
>
> My question is, have you seen anything like this before? If not, what
> do we do next?
>
> Thanks.
>
> - Bryan
>
|