Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Gleb Natapov (glebn_at_[hidden])
Date: 2007-05-18 05:27:26

On Thu, May 17, 2007 at 02:57:22PM -0400, Patrick Geoffray wrote:
> gshipman wrote:
> >> The fork() problem is due to memory registration aggravated by
> >> registration cache. Memory registration in itself is a hack from
> >> the OS
> >> point of view, and you already know a lot about the various problems
> >> related to registration cache.
> >>
> > So Gleb is indicating that this is a problem in the pipeline protocol
> > which does not use a registration cache. I think the registration
> > cache, while increasing the probability of badness after fork, is not
> > the culprit.
> Indeed, it makes things worse by extending the vulnerability outside the
> time frame of an asynchronous communication. Without the registration
> cache, the bad case is limited to a process that forks while a com is
> pending and touches the same pages before they are read/written by the
> hardware. This is not very likely because the window of time is very
> small, but still possible. However, it is not limited to the last
> partial page of the buffer, it can happen for any pinned page.
Now I see that you don't fully understand all of the IB ugliness. Here I
explain it to you. In IB QP and CQ also use registered memory that is
directly written/read by a hardware (to signal a completion or to get
next work request). After fork() parent continues to use IB of cause and
most definitely touches QP/CQ memory and at this very moment everything
breaks. So to overcome this problem (and to allow IB program fork() at all)
new madvice flag was implemented that allows userspace to mark certain
memory to not be copied to a child process. This memory is not mapped in
a child at all, no even VMA created for it. In the parent this memory is
not marked COW. All memory that is registered by IB is marked in this
way. So the problem is that if non aligned buffer is committed to MPI
it may share a page with some data that child may want to use, but this
data will not be present in a child.