A strange race condition happening for undisclosed reasons, and only fixable by replication is jeopardizing our reference count system. That sounds definitively almost scary (!)
I think that the proposed solution is just a band-aid. It somehow fixes this particular instance of the issue but leave all the others unpatched, asking for troubles later on. This problem has been lingering around for years, but we failed to address it correctly up to now.
Based on my understanding of the code the problem is not with the ref count but with the way opal_buffer_t is handled. We have no way to retrieve the pointer where the data in the opal_buffer_t is stored without a destructive operation. This means every time we need to have the pointer of the opal_buffer_t (like in the send operation to build the iovecs), we have to do a load followed by an unload, leaving the opal_buffer_t uninitialized for a short amount of time. As a result it is completely unsafe to use the same opal_buffer_t concurrently for multiple operations, as some callbacks can find the buffer uninitialized when they fire.
Now regarding the patch itself, I have to congratulate the Open MPI community for its unbelievable response time. A solution proposed, then tested on the faulty platforms, then the code carefully reviewed and finally pushed in a stable branch all in a mere 43 minutes (!). It shows that all the protection mechanism we put in place around our stable branches are entirely functional and their role is completely fulfilled. I doubt any other open source project can claim such a feat. Congratulations!
commit in the trunk @ Timestamp: 08/28/12 13:17:34 (6 hours ago)
commit in the 1.7 @ Timestamp: 08/28/12 14:00:10 (5 hours ago)
On Aug 28, 2012, at 19:17 , svn-commit-mailer_at_[hidden] wrote:
> Author: rhc (Ralph Castain)
> Date: 2012-08-28 13:17:34 EDT (Tue, 28 Aug 2012)
> New Revision: 27161
> URL: https://svn.open-mpi.org/trac/ompi/changeset/27161
> Fix a strange race condition by creating a separate buffer for each send - apparently, just a retain isn't enough protection on some systems