On Fri, 9 Mar 2012, Jeffrey Squyres wrote:
> On Mar 9, 2012, at 1:14 PM, George Bosilca wrote:
>>> The hang occurs because there is nothing on the lru to deregister and ibv_reg_mr (or GNI_MemRegister in the uGNI case) fails. The PML then puts the request on its rdma pending list and continues. If any message comes in the rdma pending list is progressed. If not it hangs indefinitely!
>> Unlike Jeff, I'm not in favor of adding bandages. If the cause is understood, then there _is_ a fix, and that fix should be the target of any efforts.
> The fix that Nathan proposes is not a complete fix -- we can still run out of memory and hang. You should read the open tickets and prior emails we have sent about this -- Nathan's fix merely delays when we will run out of registered memory. It does not solve the underlying problem.
>>> In general I have found the underlying cause of the hang is due to an imbalance of registrations between processes on a node. i.e the hung process has an empty lru but other processes could deregister. I am working on a new mpool (grdma) to handle the imbalance. The new mpool will allow a process to request that one of its peers deregisters from it lru if possible. I have a working proof of concept implementation that uses a posix shmem segment and a progress function to handle signaling and dereferencing. With it I no longer see hangs with IMB Alltoall/Alltoallv on uGNI (without putting an artificial limit on the number of registrations). I will test the mpool on infiniband later today.
>> If a solution already exists I don't see why we have to have the message code. Based on its urgency, I'm confident your patch will make its way into the 1.5 quite easily.
> Depending on the timing, this might go to 1.6 (1.5.5 has waited for too long, and this is not a regression). Keep in mind that the problem has been around for *a long, long time*, which is why I approved the diag message (i.e., because a real solution is still nowhere in sight). The real issue is that we can still run out of registered memory *and there is nothing left to deregister*. The real solution there is that the PML should fall back to a different protocol, but I'm told that doesn't happen and will require a bunch of work to make work properly.
An mpool that is aware of local processes lru's will solve the problem in most cases (all that I have seen) but yes, we need to rework the pml to handle the remaining cases. There are two things that need to be changed (from what I can tell):
1) allow rget to fallback to send/put depending on the failure (I have fallback on put implemented in my branch-- and in my btl).
2) need to devise new criteria on when we should progress the rdma_pending list to avoid deadlock.
#1 is fairly simple and I haven't given much though to #2.