Did a little digging into this last night, and finally figured out what you were getting at in your comments here. Yeah, I think an "affinity" framework would definitely be the best approach - can handle both cpu and memory, I  imagine. Isn't clear how pressing that is as it is mostly an optimization issue, but you're welcome to create the framework if you like.


On Sun, 2005-07-17 at 09:13, Jeff Squyres wrote:
It needs to be done in the launched process itself.  So we'd either 
have to extend rmaps (from my understanding of rmaps, that doesn't seem 
like a good idea), or do something different.

Perhaps the easiest thing to do is to add this to the LANL meeting 
agenda...?  Then we can have a whiteboard to discuss.  :-)



On Jul 17, 2005, at 10:26 AM, Ralph Castain wrote:

> Wouldn't it belong in the rmaps framework? That's where we tell the
> launcher where to put each process - seems like a natural fit.
>
>
> On Jul 17, 2005, at 6:45 AM, Jeff Squyres wrote:
>
>> I'm thinking that we should add some processor affinity code to OMPI 
>> --
>> possibly in the orte layer (ORTE is the interface to the back-end
>> launcher, after all).  This will really help on systems like opterons
>> (and others) to prevent processes from bouncing between processors, 
>> and
>> potentially getting located far from "their" RAM.
>>
>> This has the potential to help even micro-benchmark results (e.g.,
>> ping-pong).  It's going to be quite relevant for my shared memory
>> collective work on mauve.
>>
>>
>> General scheme:
>> ---------------
>>
>> I think that somewhere in ORTE, we should actively set processor
>> affinity when:
>>    - Supported by the OS
>>    - Not disabled by the user (via MCA param)
>>    - The node is not over-subscribed with processes from this job
>>
>> Generally speaking, if you launch <=N processes in a job on a node
>> (where N == number of CPUs on that node), then we set processor
>> affinity.  We set each process's affinity to the CPU number according
>> to the VPID ordering of the procs in that job on that node.  So if you
>> launch VPIDs 5, 6, 7, 8 on a node, 5 would go to processor 0, 6 would
>> go to processor 1, etc. (it's an easy, locally-determined ordering).
>>
>> Someday, we might want to make this scheme universe-aware (i.e., see 
>> if
>> any other ORTE jobs are running on that node, and not schedule on any
>> processors that are already claimed by the processes on that(those)
>> job(s)), but I think single-job awareness is sufficient for the 
>> moment.
>>
>>
>> Implementation:
>> ---------------
>>
>> We'll need relevant configure tests to figure out if the target system
>> as CPU affinity system calls.  Those are simple to add.
>>
>> We could use simply #if statements for the affinity stuff or make it a
>> real framework.  Since it's only 1 function call to set the affinity, 
>> I
>> tend to lean towards the [simpler] #if solution, but could probably be
>> pretty easily convinced that a framework is the Right solution.  I'm 
>> on
>> the fence (and if someone convinces me, I'd volunteer for the extra
>> work to setup the framework).
>>
>> I'm not super-familiar with the processor-affinity stuff (e.g., for
>> best effect, should it be done after the fork and before the exec?), 
>> so
>> I'm not sure exactly where this would go in ORTE.  Potentially either
>> before new processes are exec'd (where we only have control of that in
>> some kinds of systems, like rsh/ssh) or right up very very near the 
>> top
>> of orte_init().
>>
>> Comments?
>>
>> -- 
>> {+} Jeff Squyres
>> {+} The Open MPI Project
>> {+} http://www.open-mpi.org/
>>
>> _______________________________________________
>> devel mailing list
>> devel@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> _______________________________________________
> devel mailing list
> devel@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>