Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_AllReduce() deadlock on IB
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-03-16 20:38:06

This could be related to and/or

There isn't much info in the ticket, but we've been talking about it a bunch offline. IBM and Mellanox have had reports of the error, but haven't been able to reproduce it reliably. It *seems* to be a race condition in the "oob" connection model of the openib BTL.

If you run with --mca btl_openib_cpc_include rdmacm, does the problem go away?

On Mar 16, 2011, at 11:27 AM, Brock Palen wrote:

> I have a user whos code when ran on ethernet performs fine. When ran on verbs based IB the code deadlocks in an MPI_AllReduce() call.
> We are using openmpi/1.4.3 with the intel compilers.
> I poked at the running code with padb and I get the following:
> 0....5....1....5....2....5....3....5....4....5....
> ,,---,-,-,----,--,--,,-,RRRRRRRR,---,----,,--,-,-,
> ,,-,-,,,-,,--,-,,-,-,-,-RRRRRRRR-,-,---,,,--,,---,
> ,,---,-,,,,-,-,,-,-,----RRRRRRRR,----,-,--,,-----,
> --,,-,-,,,,-,,------,,--RRRRRRRR,,----,,--,------,
> For multiple runs which ranks are stuck in AllReduce() changes,
> Is there any open bugs? I found one but only on shared memory and our version should be new enough (from what I could tell) to avoid it.
> Thanks, what should I look for to diagnose the issue?
> Brock Palen
> Center for Advanced Computing
> brockp_at_[hidden]
> (734)936-1985
> _______________________________________________
> users mailing list
> users_at_[hidden]

Jeff Squyres
For corporate legal information go to: