Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r23936
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-10-25 20:22:21


I dug into this a bit.

The problem is in the SM BTL init where it's waiting for all of the peers to set seg_inited in shared memory (so that it knows everyone has hit that point). We loop on calling opal_progress while waiting.

The problem is that opal_progress() is not returning (!).

It appears that libevent's poll_dispatch() function is somehow getting an infinite timeout -- it *looks* like libevent is determining that there are no timers active, so it decides to set an infinite timeout (i.e., block) when it calls poll(). Specifically, event.c:1524 calls timeout_next(), which sees that there are no timer events active and resets tv_p to NULL. We then call the underlying fd-checking backend with an infinite timeout.

Bonk.

Anyone more familiar with libevent's internals know why this is happening / if this is a change since the old version?

On Oct 25, 2010, at 6:07 PM, Jeff Squyres wrote:

> On Oct 25, 2010, at 3:21 PM, George Bosilca wrote:
>
>> So now we're in good shape, at least for compiling. IB and TCP seem to work, but SM deadlock.
>
> Ugh.
>
> Are you debugging this, or are we? (i.e., me/Ralph)
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/