Jeff Squyres wrote:
> (for the web archives)
>
> Brock and I talked about this .f90 code a bit off list -- he's going
> to investigate with the test author a bit more because both of us are
> a bit confused by the F90 array syntax used.
Attached is a simple send/recv code written (procedural) C++ that
illustrates a similar problem. It dies at a random number of iterations
with openmpi-1.3.2 or .3. (I have submitted this before.) On some
machines
this goes away with the "-mca btl_sm_num_fifos 8" or
"-mca btl ^sm", so I think this is
https://svn.open-mpi.org/trac/ompi/ticket/2043.
Since it has barriers after each send/recv pair, I do not understand how
any buffers could fill up.
Various stats:
iter.cary$ uname -a
Linux iter.txcorp.com 2.6.29.4-167.fc11.x86_64 #1 SMP Wed May 27
17:27:08 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
iter.cary$ g++ --version
g++ (GCC) 4.4.0 20090506 (Red Hat 4.4.0-4)
Copyright (C) 2009 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
iter.cary$ mpicxx -show
g++ -I/usr/local/openmpi-1.3.2-nodlopen/include -pthread
-L/usr/local/torque-2.4.0b1/lib -Wl,--rpath
-Wl,/usr/local/torque-2.4.0b1/lib
-Wl,-rpath,/usr/local/openmpi-1.3.2-nodlopen/lib
-L/usr/local/openmpi-1.3.2-nodlopen/lib -lmpi_cxx -lmpi -lopen-rte
-lopen-pal -ltorque -ldl -lnsl -lutil -lm
Seen failures on 64 bit hardware only.
John Cary
>
>
>
> On Dec 1, 2009, at 10:46 AM, Brock Palen wrote:
>
>> The attached code, is an example where openmpi/1.3.2 will lock up, if
>> ran on 48 cores, of IB (4 cores per node),
>> The code loops over recv from all processors on rank 0 and sends from
>> all other ranks, as far as I know this should work, and I can't see
>> why not.
>> Note yes I know we can do the same thing with a gather, this is a
>> simple case to demonstrate the issue.
>> Note that if I increase the openib eager limit, the program runs,
>> which normally means improper MPI, but I can't on my own figure out
>> the problem with this code.
>>
>> Any input on why code like this locks up, unless we up the eager
>> buffer would be helpful, as we should be be having to up the buffer
>> size, just to make code run, makes me feel hacky and dirty.
>>
>>
>> <sendbuf.f90><ATT9198877.txt><ATT9198879.txt>
>
>
/**
* A simple test program to demonstrate a problem in OpenMPI 1.3
*
* Make with:
* mpicxx -o ompi1.3.3-bug ompi1.3.3-bug.cxx
*
* Run with:
* mpirun -n 3 ompi1.3.3-bug
*/
// mpi includes
#include <mpi.h>
// std includes
#include <iostream>
#include <vector>
// useful hashdefine
#define ARRAY_SIZE 250
/**
* Main driver
*/
int main(int argc, char** argv) {
// Initialize MPI
MPI_Init(&argc, &argv);
int rk, sz;
MPI_Comm_rank(MPI_COMM_WORLD, &rk);
MPI_Comm_size(MPI_COMM_WORLD, &sz);
// Create some data to pass around
std::vector<double> d(ARRAY_SIZE);
// Initialize to some values if we aren't rank 0
if ( rk )
for ( unsigned i = 0; i < ARRAY_SIZE; ++i )
d[i] = 2*i + 1;
// Loop until this breaks
unsigned t = 0;
while ( 1 ) {
MPI_Status s;
if ( rk )
MPI_Send( &d[0], d.size(), MPI_DOUBLE, 0, 3, MPI_COMM_WORLD );
else
for ( int i = 1; i < sz; ++i )
MPI_Recv( &d[0], d.size(), MPI_DOUBLE, i, 3, MPI_COMM_WORLD, &s );
MPI_Barrier(MPI_COMM_WORLD);
std::cout << "Transmission " << ++t << " completed." << std::endl;
}
// Finalize MPI
MPI_Finalize();
}
|