On Nov 14, 2013, at 1:16 PM, Shamis, Pavel <shamisp_at_[hidden]> wrote:
>> 1. Ralph made the OOB asynchronous.
I pondered this for awhile today, and I just want to correct any misimpression this statement might leave, especially with folks who haven't been around the project that much over the last couple of years. Just to clarify: this wasn't a case of Ralph waking up one day and saying "hey, let's make the OOB async!". Quite the contrary.
This whole conversion process started nearly two years ago when we, as a community, decided to move towards an async progress model. We laid out all the things that we thought would need to be done to make that happen...and then we started down that path. First, we updated the event library to the 2.x series so we could separate the event bases for the different layers, and so we could have event priority levels. Some folks started hardening the BTLs for thread safety and adding progress threads inside them. Etc.
One step on that path was to make ORTE operate asynchronously as a purely event-driven library. First, we rewrote the state machine so all ORTE operations ran in an event, except for the OOB as that can of worms was just too hard. Frankly, nobody wanted to touch it, so we left it alone and made everything else work.
Finally, I took on the OOB rewrite. One of our continual problems was deadlocking somewhere because someone would call a blocking send/recv while in an OOB callback - usually way down in the stack somewhere that wasn't immediately obvious to the user. After spending time fiddling with things, it became clear that the only simple solution was to make the OOB totally non-blocking. This also made a much cleaner integration to the rest of the ORTE state machine.
So we brought it up at a couple of developer meetings, talked a number of times on the weekly telecon, went thru several email threads, RFCs, etc. - with me emphasizing repeatedly that the OOB was going to lose its blocking interfaces. The fact that OOB callbacks would be occurring in the ORTE event base thread was also discussed, and was one of the reasons why we locked libevent thread protection "on" earlier this year. This fact may have escaped some people, but it was discussed on several occasions.
The proof of the pudding is that all of the MPI layer has been adapted to the new async behavior -except- for the openib cpc's. The issue of what to do with these has been raised several times, especially once the ofacm code was committed. Unfortunately, lack of time and priorities left this code to bitrot.
I'm not pointing fingers at anyone, nor am I saying this was all perfect. Just trying to point out that this was a community move that is part of our community roadmap, and we perhaps need to be better at finding a way to keep everyone/everything a little more connected to the convoy. This is going to get even more rocky in the next year as we push towards full thread safety and async progress, and re-implement checkpoint/restart support.