Hello,
I seem to have encountered a bug in Open MPI 1.0 using indexed datatypes
with MPI_Recv (which seems to be of the "off by one" sort). I have
joined a test case, which is briefly explained below (as well as in the
source file). This case should run on two processes. I observed the bug
on 2 different Linux systems (single processor Centrino under Suse 10.0
with gcc 4.0.2, dual-processor Xeon under Debian Sarge with gcc 3.4)
with Open MPI 1.0.1, and do not observe it using LAM 7.1.1 or MPICH2.
Here is a summary of the case:
------------------
Each processor reads a file ("data_p0" or "data_p1") giving a list of
global element ids. Some elements (vertices from a partitionned mesh)
may belong to both processors, so their id's may appear on both
processors: we have 7178 global vertices, 3654 and 3688 of them being
known by ranks 0 and 1 respectively.
In this simplified version, we assign coordinates {x, y, z} to each
vertex equal to it's global id number for rank 1, and the negative of
that for rank 0 (assigning the same values to x, y, and z). After
finishing the "ordered gather", rank 0 prints the global id and
coordinates of each vertex.
lines should print (for example) as:
6456 ; 6455.00000 6455.00000 6456.00000
6457 ; -6457.00000 -6457.00000 -6457.00000
depending on whether a vertex belongs only to rank 0 (negative
coordinates) or belongs to rank 1 (positive coordinates).
With the OMPI 1.0.1 bug (observed on Suse Linux 10.0 with gcc 4.0 and on
Debian sarge with gcc 3.4), we have for example for the last vertices:
7176 ; 7175.00000 7175.00000 7176.00000
7177 ; 7176.00000 7176.00000 7177.00000
seeming to indicate an "off by one" type bug in datatype handling
Not using an indexed datatype (i.e. not defining USE_INDEXED_DATATYPE
in the gather_test.c file), the bug dissapears. Using the indexed
datatype with LAM MPI 7.1.1 or MPICH2, we do not reproduce the bug
either, so it does seem to be an Open MPI issue.
------------------
Best regards,
Yvan Fournier
|