Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] possible bug in 1.3.2 sm transport
From: Bryan Lally (lally_at_[hidden])
Date: 2009-05-18 23:49:46


Eugene Loh wrote:
> Ralph Castain wrote:
>
>> Hi Bryan
>>
>> I have seen similar issues on LANL clusters when message sizes were
>> fairly large. How big are your buffers when you call Allreduce? Can
>> you send us your Allreduce call params (e.g., the reduce operation,
>> datatype, num elements)?
>>
>> If you don't want to send that to the list, you can send it to me at
>> LANL.
>
> I haven't seen any updates on this. Please tell me Bryan sent info to
> Ralph at LANL and Ralph nailed this one. Please! :^)

Eugene,

I've got mostly good news ...

Ralph sent me a platform file and a corresponding .conf file. I built
ompi from openmpi-1.3.3a1r21223.tar.gz, with these files. I've been
running my normal tests and have been unable to hang a job yet. I've
run enough that I don't expect to see a problem.

So we're up and running, but with some extra voodoo in the platform
files. This is on a totally vanilla Fedora 9 installation (other than a
couple of Fortran compilers, but we're not using the Fortran interface
to mpi), running on a Dell workstation with 2 quad-core CPUs - vanilla
hardware, too. MPI isn't going out of the box.

 From a user's perspective, configure should be setting the right
defaults on such a setup. But the core code seems to be working - I'm
giving it a good hammering.

The allreduces in question were doing a logical or on 1 integer from
each process - it was an error check. Hence the buffers (on the
application side) were 4 bytes. There were only 4 processes involved.

        - Bryan

-- 
Bryan Lally, lally_at_[hidden]
505.667.9954
CCS-2
Los Alamos National Laboratory
Los Alamos, New Mexico