Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] mpi_comm_spawn have problems with group communicators
From: Milan Hodoscek (milan_at_[hidden])
Date: 2010-10-04 13:48:23

>>>>> "Ralph" == Ralph Castain <rhc_at_[hidden]> writes:

    Ralph> On Oct 4, 2010, at 10:36 AM, Milan Hodoscek wrote:

>>>>>>> "Ralph" == Ralph Castain <rhc_at_[hidden]> writes:
    Ralph> I'm not sure why the group communicator would make a
    Ralph> difference - the code area in question knows nothing about
    Ralph> the mpi aspects of the job. It looks like you are hitting a
    Ralph> race condition that causes a particular internal recv to
    Ralph> not exist when we subsequently try to cancel it, which
    Ralph> generates that error message. How did you configure OMPI?
>> Thank you for the reply!
>> Must be some race problem, but I have no control of it, or do
>> I?

    Ralph> Not really. What I don't understand is why your code would
    Ralph> work fine when using comm_world, but encounter a race
    Ralph> condition when using comm groups. There shouldn't be any
    Ralph> timing difference between the two cases.

Fixing race condition is sometime easy by puting some variables into
the arrays. I just did for one of them but it didn't help. I'll do
some more testing in this direction, but I am running out of ideas.
When you put ngrp=1 and uncomment the other mpi_comm_spawn line in the
program you basically get only one spawn, so no opportunity for race
condition. But in my real project I usually work with many spawn
calls, however all using mpi_comm_world, but running different
programs, etc. And that always works. This time I want to localize
mpi_comm_spawns by similar trick that is in the program I sent. So
this small test case is a good model of what I would like to have.
I studied the MPI-2 standard and I think I got it right, but one never

    Ralph> I'll have to take a look and see if I can spot something in
    Ralph> the code...

Thanks a lot -- Milan