Eugene Loh wrote:
> Ralph Castain wrote:
>> Hi Bryan
>> I have seen similar issues on LANL clusters when message sizes were
>> fairly large. How big are your buffers when you call Allreduce? Can
>> you send us your Allreduce call params (e.g., the reduce operation,
>> datatype, num elements)?
>> If you don't want to send that to the list, you can send it to me at
> I haven't seen any updates on this. Please tell me Bryan sent info to
> Ralph at LANL and Ralph nailed this one. Please! :^)
I've got mostly good news ...
Ralph sent me a platform file and a corresponding .conf file. I built
ompi from openmpi-1.3.3a1r21223.tar.gz, with these files. I've been
running my normal tests and have been unable to hang a job yet. I've
run enough that I don't expect to see a problem.
So we're up and running, but with some extra voodoo in the platform
files. This is on a totally vanilla Fedora 9 installation (other than a
couple of Fortran compilers, but we're not using the Fortran interface
to mpi), running on a Dell workstation with 2 quad-core CPUs - vanilla
hardware, too. MPI isn't going out of the box.
From a user's perspective, configure should be setting the right
defaults on such a setup. But the core code seems to be working - I'm
giving it a good hammering.
The allreduces in question were doing a logical or on 1 integer from
each process - it was an error check. Hence the buffers (on the
application side) were 4 bytes. There were only 4 processes involved.
Bryan Lally, lally_at_[hidden]
Los Alamos National Laboratory
Los Alamos, New Mexico