Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI deadlocks and race conditions ?
From: François PELLEGRINI (francois.pellegrini_at_[hidden])
Date: 2009-05-15 03:19:49


Bonjour Eugene,

users-request_at_[hidden] wrote:
> Date: Thu, 14 May 2009 17:06:07 -0700
> From: Eugene Loh <Eugene.Loh_at_[hidden]>
> Subject: Re: [OMPI users] OpenMPI deadlocks and race conditions ?
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <4A0CB1EF.5050403_at_[hidden]>
> Content-Type: text/plain; format=flowed; charset=ISO-8859-1
>
> Fran?ois PELLEGRINI wrote:
>
>> I sometimes run into deadlocks in OpenMPI (1.3.3a1r21206), when
>> running my MPI+threaded PT-Scotch software.
>>
> So, are there multiple threads per process that perform message-passing
> operations?

Yes. I use the MPI_THREAD_MULTIPLE level of MPI.

In some parts of the code, two threads can perform
simultaneous point-to-point and collective communication.
When they do so, it is on duplicated or split communicators,
not on the same one.

> Other comments below.

Thanks for the analysis. So, to synthesize, you think that,
for the part you reported on, valgrind (helgrind) is wrong,
because these concurrent accesses on shared data are performed
after some software lock has been set, such that no two
communication routines can write data at the same place at
the same time.

However, I still wonder about the deadlocks I have. Maybe
there is still a bug in my code and I update data structures
that are used by another communicating thread, but helgrind
would have noticed them, I guess. As I reported, what is most
puzzling to me is that barrier communication on one thread is
completed by a waitall on another thread, on another (duplicated)
communicator.

When communicators are duplicated or split, do they still share
some low-level data ? If so, maybe this is where the problem lies,
if this data is not fully protected ?

Thanks for the help,

                                f.p.