On Oct 4, 2010, at 10:36 AM, Milan Hodoscek wrote:
>>>>>> "Ralph" == Ralph Castain <rhc_at_[hidden]> writes:
> Ralph> I'm not sure why the group communicator would make a
> Ralph> difference - the code area in question knows nothing about
> Ralph> the mpi aspects of the job. It looks like you are hitting a
> Ralph> race condition that causes a particular internal recv to
> Ralph> not exist when we subsequently try to cancel it, which
> Ralph> generates that error message. How did you configure OMPI?
> Thank you for the reply!
> Must be some race problem, but I have no control of it, or do I?
Not really. What I don't understand is why your code would work fine when using comm_world, but encounter a race condition when using comm groups. There shouldn't be any timing difference between the two cases.
> These are the configure options that gentoo compiles openmpi-1.4.2 with:
> ./configure --prefix=/usr --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --libdir=/usr/lib64 --sysconfdir=/etc/openmpi --without-xgrid --enable-pretty-print-stacktrace --enable-orterun-prefix-by-default --without-slurm --enable-contrib-no-build=vt --enable-mpi-cxx --disable-io-romio --disable-heterogeneous --without-tm --enable-ipv6
This looks okay.
I'll have to take a look and see if I can spot something in the code...