There are fundamental differences between the two event engines. Just as an example the new one allow us to have multiple pools of sockets, making a lot easier to implement optimized asynchronous operations using multiple threads. If we add the old engine back, we will either have to implement these features there, or find the smaller denominator between the two engines and use it in the rest of the code (ORTE, OMPI layers). This sounds like a regression to me.
On Oct 26, 2010, at 10:37 , Joshua Hursey wrote:
> I like the idea of putting the old libevent back as a separate component, just for performance/correctness comparisons. I think it would be good for the trunk, but for the release branches just choose one version to ship (so we don't confuse users).
> -- Josh
> On Oct 26, 2010, at 6:27 AM, Jeff Squyres (jsquyres) wrote:
>> Btw it strikes me that we could put the old libevent back as a separate component for comparisons.
>> Sent from my PDA. No type good.
>> On Oct 26, 2010, at 6:20 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>> On Oct 25, 2010, at 9:29 PM, George Bosilca wrote:
>>>> 1. Not all processes deadlock in btl_sm_add_procs. The process that setup the shared memory area, is going forward, and block later in a barrier.
>>> Yes, I'm seeing the same thing (I didn't include all details like this in my post, sorry). I was running with -np 2 on a local machine and saw vpid=0 get stuck in opal_progress (because the first time through, seg_inited < n_local_procs). vpid=1 increments seg_inited and therefore doesn't enter the loop that calls opal_progress(), and therefore continues on.
>>>> 2. All other processes, loop around the opal_progress, until they got a message from all other processes. The variable used for counting is somehow updated correctly, but we still call opal_progress. I couldn't figure out is we loop more that we should, or if opal_progress doesn't return. However, both of these possibilities look very unlikely to me: the loop in the sm_add_procs is pretty straightforward, and I couldn't find any loops in opal_progress. I wonder if some of the messages get lost on the exchange.
>>> I had this problem, too, until I tried to use padb to get stack traces. I noticed that when I ran padb, my blocked process un-blocked itself and continued. After more digging, I determined that my blocked process was, in fact, blocked in poll() with an infinite timeout. padb (or any signal at all) caused it to unblock and therefore continue.
>>>> 3. If I unblock the situation by hand, everything goes back to normal. NetPIPE runs to completion but the performances are __really__ bad. On my test machine I get around 2000Mbs, when the expected value is at least 10 times more. Similar finding on the latency side, we're now at 1.65 micro-sec up from the usual 0.35 we had before.
>>> It's a feature!
>>> Jeff Squyres
>>> For corporate legal information go to:
>>> devel mailing list
>> devel mailing list
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> devel mailing list