Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] processor affinity -- OpenMPI/batch system integration
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-09-30 15:23:40


Note that we would also have to modify OMPI to:

1. recognize these environmental variables, and

2. use them to actually set the binding, instead of using OMPI-
internal directives

Not a big deal to do, but not something currently in the system. Since
we launch through our own daemons (something that isn't likely to
change in your time frame), these changes would be required.

Otherwise, we could come up with some method by which you could
provide mapper information we use. While I agree with Jeff that having
you tell us which cores to use for each rank would generally be
better, it does raise issues when users want specific mapping
algorithms that you might not support. For example, we are working on
mappers that will take input from the user regarding comm topology
plus system info on network wiring topology and generate a near-
optimal mapping of ranks. As part of that, users may request some
number of cores be reserved for that rank for threading or other
purposes.

So perhaps both options would be best - give us the list of cores
available to us so we can map and do affinity, and pass in your own
mapping. Maybe with some logic so we can decide which to use based on
whether OMPI or GE did the mapping??

Not sure here - just thinking out loud.
Ralph

On Sep 30, 2008, at 12:58 PM, Jeff Squyres wrote:

> On Sep 30, 2008, at 2:51 PM, Rayson Ho wrote:
>
>> Restarting this discussion. A new update version of Grid Engine 6.2
>> will come out early next year [1], and I really hope that we can get
>> at least the interface defined.
>
> Great!
>
>> At the minimum, is it enough for the batch system to tell OpenMPI via
>> an env variable which core (or virtual core, in the SMT case) to
>> start
>> binding the first MPI task?? I guess an added bonus would be
>> information about the number of processors to skip (the stride)
>> between the sibling tasks?? Stride of one is usually the case, but
>> something larger than one would allow the batch system to control the
>> level of cache and memory bandwidth sharing between the MPI tasks...
>
> Wouldn't it be better to give us a specific list of cores to bind
> to? As core counts go up in servers, I think we may see a re-
> emergence of having multiple MPI jobs on a single server. And as
> core counts go even *higher*, then fragmentation of available cores
> over time is possible/likely.
>
> Would you be giving us a list of *relative* cores to bind to (i.e.,
> "bind to the Nth online core on the machine" -- which may be
> different than the OS's ID for that processor) or will you be giving
> us the actual OS virtual processor ID(s) to bind to?
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel