I dug into this a bit.
The problem is in the SM BTL init where it's waiting for all of the peers to set seg_inited in shared memory (so that it knows everyone has hit that point). We loop on calling opal_progress while waiting.
The problem is that opal_progress() is not returning (!).
It appears that libevent's poll_dispatch() function is somehow getting an infinite timeout -- it *looks* like libevent is determining that there are no timers active, so it decides to set an infinite timeout (i.e., block) when it calls poll(). Specifically, event.c:1524 calls timeout_next(), which sees that there are no timer events active and resets tv_p to NULL. We then call the underlying fd-checking backend with an infinite timeout.
Anyone more familiar with libevent's internals know why this is happening / if this is a change since the old version?
On Oct 25, 2010, at 6:07 PM, Jeff Squyres wrote:
> On Oct 25, 2010, at 3:21 PM, George Bosilca wrote:
>> So now we're in good shape, at least for compiling. IB and TCP seem to work, but SM deadlock.
> Are you debugging this, or are we? (i.e., me/Ralph)
> Jeff Squyres
> For corporate legal information go to:
For corporate legal information go to: