Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r26106
From: Shamis, Pavel (shamisp_at_[hidden])
Date: 2012-03-09 15:16:52


>> Depending on the timing, this might go to 1.6 (1.5.5 has waited for too long, and this is not a regression). Keep in mind that the problem has been around for *a long, long time*, which is why I approved the diag message (i.e., because a real solution is still nowhere in sight). The real issue is that we can still run out of registered memory *and there is nothing left to deregister*. The real solution there is that the PML should fall back to a different protocol, but I'm told that doesn't happen and will require a bunch of work to make work properly.
>
> An mpool that is aware of local processes lru's will solve the problem in most cases (all that I have seen) but yes, we need to rework the pml to handle the remaining cases. There are two things that need to be changed (from what I can tell):
>
> 1) allow rget to fallback to send/put depending on the failure (I have fallback on put implemented in my branch-- and in my btl).
> 2) need to devise new criteria on when we should progress the rdma_pending list to avoid deadlock.
>
> #1 is fairly simple and I haven't given much though to #2.

But #1 will be good start in right direction.Agree about #2.

>
> -Nathan
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel