Open MPI logo

PLPA Users' Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all PLPA Users mailing list

From: Bogdan Costescu (Bogdan.Costescu_at_[hidden])
Date: 2005-12-19 15:29:56


On Mon, 19 Dec 2005, Paul H. Hargrove wrote:

> I like it as a sanity check on the len value we've selected.

Yes, I agree with that...

> we could still make a set() call to validate length, but pass a NULL
> pointer for mask. Then we consider the validation to be a success
> if we get errno==EFAULT rather than errno==EINVAL (which works
> because len is checked first).

I just checked all the kernels that I looked at before and this would
work for all of them. Nice idea!

> I want to say that I doubt that the races described are a real issue in
> practice.

I have encountered lots of such assumptions in my work - according to
the code authors, they were very unlikely to happen, but they did
happen...

Furthermore, as Jeff wrote in the docs, this is probably a library
that won't see much change over time. I spent some time to research
all of these and I have all the data still fresh, so it was for me
much easier to think of it now that would be to do it again in one
year's time when somebody suddenly starts to hit the race. ;-)

Anyway, even if the code remains like it is now, this discussion will
be archived, so that the affected people can find a solution.

> The potential that second thread in the same process (an "internal
> external entity") can run the entire probe to completion AND issue a
> set() for the current thread, all between the get() and set() is
> VERY small, and a mutex inside the probe function could eliminate
> this potential race entirely.

Not if they get scheduled on different CPUs with maybe different
loads; one of them might even get swapped out and back in during this
time. I actually didn't think at all about threads when I wrote about
the race...

> If run in a batch environment that wants to set our affinity, that
> has almost certainly been done between fork() and exec() in the same
> place rlimits are "imposed" upon a job.

Me and Jeff are discussing this subject (who, how and when to set
affinity) on the SGE developers list, although we didn't yet touch
implementation issues:

http://gridengine.sunsource.net/servlets/ReadMsg?list=dev&msgNo=2574

The way you describe it is the most logical, but not necessarily the
most useful.

> Finally, I think that the possibility that a CPU goes offline
> between the get/set is also so small that I am OK with ignoring it.

Don't think only about physical CPUs; OpenMPI could run in a virtual
machine as well.

-- 
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu_at_[hidden]