Subject: [OMPI devel] Deadlock with comm_create since cid allocator change
From: Sylvain Jeaugey (sylvain.jeaugey_at_[hidden])
Date: 2009-09-21 07:51:01

Hi list,

We are currently experiencing deadlocks when using communicators other
than MPI_COMM_WORLD. So we made a very simple reproducer (Comm_create then
MPI_Barrier on the communicator - see end of e-mail).

We can reproduce the deadlock only with openib and with at least 8 cores
(no success with sm) and after ~20 runs average. Using larger number of
cores greatly increases the occurence of the deadlock. When the deadlock
occurs, every even process is stuck in MPI_Finalize and every odd process
is in MPI_Barrier.

So we tracked the bug in the changesets and found out that this patch seem
to have introduced the bug :

user: brbarret
date: Tue Aug 25 15:13:31 2009 +0000
summary: Per discussion in ticket #2009, temporarily disable the block CID allocation
algorithms until they properly reuse CIDs.

Reverting to the non multi-thread cid allocator makes the deadlock

I tried to dig further and understand why this makes a difference, with no

If anyone can figure out what's happening, that would be great ...


#include <mpi.h>
#include <stdio.h>

int main(int argc, char **argv) {
     int rank, numTasks;
     int range[3];
     MPI_Comm testComm, dupComm;
     MPI_Group orig_group, new_group;

     MPI_Init(&argc, &argv);
     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
     MPI_Comm_size(MPI_COMM_WORLD, &numTasks);
     MPI_Comm_group(MPI_COMM_WORLD, &orig_group);
     range[0] = 0; /* first rank */
     range[1] = numTasks - 1; /* last rank */
     range[2] = 1; /* stride */
     MPI_Group_range_incl(orig_group, 1, &range, &new_group);
     MPI_Comm_create(MPI_COMM_WORLD, new_group, &testComm);
     return 0;