>>>>> "Ralph" == Ralph Castain <rhc_at_[hidden]> writes:
Ralph> On Oct 4, 2010, at 10:36 AM, Milan Hodoscek wrote:
>>>>>>> "Ralph" == Ralph Castain <rhc_at_[hidden]> writes:
>>
Ralph> I'm not sure why the group communicator would make a
Ralph> difference - the code area in question knows nothing about
Ralph> the mpi aspects of the job. It looks like you are hitting a
Ralph> race condition that causes a particular internal recv to
Ralph> not exist when we subsequently try to cancel it, which
Ralph> generates that error message. How did you configure OMPI?
>>
>> Thank you for the reply!
>>
>> Must be some race problem, but I have no control of it, or do
>> I?
Ralph> Not really. What I don't understand is why your code would
Ralph> work fine when using comm_world, but encounter a race
Ralph> condition when using comm groups. There shouldn't be any
Ralph> timing difference between the two cases.
Fixing race condition is sometime easy by puting some variables into
the arrays. I just did for one of them but it didn't help. I'll do
some more testing in this direction, but I am running out of ideas.
When you put ngrp=1 and uncomment the other mpi_comm_spawn line in the
program you basically get only one spawn, so no opportunity for race
condition. But in my real project I usually work with many spawn
calls, however all using mpi_comm_world, but running different
programs, etc. And that always works. This time I want to localize
mpi_comm_spawns by similar trick that is in the program I sent. So
this small test case is a good model of what I would like to have.
I studied the MPI-2 standard and I think I got it right, but one never
knows...
Ralph> I'll have to take a look and see if I can spot something in
Ralph> the code...
Thanks a lot -- Milan
|