Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] Deadlock with comm_create since cid allocator change
From: Sylvain Jeaugey (sylvain.jeaugey_at_[hidden])
Date: 2009-09-21 08:07:42


You were faster to fix the bug than I was to send my bug report :-)

So I confirm : this fixes the problem.

Thanks !
Sylvain

On Mon, 21 Sep 2009, Edgar Gabriel wrote:

> what version of OpenMPI did you use? Patch #21970 should have fixed this
> issue on the trunk...
>
> Thanks
> Edgar
>
> Sylvain Jeaugey wrote:
>> Hi list,
>>
>> We are currently experiencing deadlocks when using communicators other than
>> MPI_COMM_WORLD. So we made a very simple reproducer (Comm_create then
>> MPI_Barrier on the communicator - see end of e-mail).
>>
>> We can reproduce the deadlock only with openib and with at least 8 cores
>> (no success with sm) and after ~20 runs average. Using larger number of
>> cores greatly increases the occurence of the deadlock. When the deadlock
>> occurs, every even process is stuck in MPI_Finalize and every odd process
>> is in MPI_Barrier.
>>
>> So we tracked the bug in the changesets and found out that this patch seem
>> to have introduced the bug :
>>
>> user: brbarret
>> date: Tue Aug 25 15:13:31 2009 +0000
>> summary: Per discussion in ticket #2009, temporarily disable the block
>> CID allocation
>> algorithms until they properly reuse CIDs.
>>
>> Reverting to the non multi-thread cid allocator makes the deadlock
>> disappear.
>>
>> I tried to dig further and understand why this makes a difference, with no
>> luck.
>>
>> If anyone can figure out what's happening, that would be great ...
>>
>> Thanks,
>> Sylvain
>>
>> #include <mpi.h>
>> #include <stdio.h>
>>
>> int main(int argc, char **argv) {
>> int rank, numTasks;
>> int range[3];
>> MPI_Comm testComm, dupComm;
>> MPI_Group orig_group, new_group;
>>
>> MPI_Init(&argc, &argv);
>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>> MPI_Comm_size(MPI_COMM_WORLD, &numTasks);
>> MPI_Comm_group(MPI_COMM_WORLD, &orig_group);
>> range[0] = 0; /* first rank */
>> range[1] = numTasks - 1; /* last rank */
>> range[2] = 1; /* stride */
>> MPI_Group_range_incl(orig_group, 1, &range, &new_group);
>> MPI_Comm_create(MPI_COMM_WORLD, new_group, &testComm);
>> MPI_Barrier(testComm);
>> MPI_Finalize();
>> return 0;
>> }
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> --
> Edgar Gabriel
> Assistant Professor
> Parallel Software Technologies Lab http://pstl.cs.uh.edu
> Department of Computer Science University of Houston
> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>