Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] RTE node allocation component
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-04-14 17:36:46

The 1.6 branch is a stable series - no new features will be added to it, so your patch won't be going there. I'd focus solely on the trunk.

What you're doing with he RAS is fine for now. In the next few days, I'll be changing the API to the RAS components, but it isn't a big change and we can adjust as you get closer. The orte_job_t object does contain the number of procs to be launched prior to the RAS being invoked, but you have to compute it. Each app_context contains that number - so to get it for the job, you cycle across all the app_contexts and add it up.

The mapper assigns the final num_procs value in the orte_job_t object. We do this because the user can also run the job without specifying the number of procs, and we'll simply run one proc for every allocated slot. It's a popular option, but wouldn't work here for obvious reasons.

On Apr 14, 2012, at 2:55 PM, Alex Margolin wrote:

> As to the old version: I'm working in parallel on a patch to branch 1.6 and the trunk, which (the patches, not the versions) are almost identical.
> There is a minor difference in my patch for the RAS: in the trunk I used the preexisting total_slots_alloc while in 1.6 I added it to orte_ras_base (exactly whee it is located in the trunk). I admit it's not the original intent of the author of orte_ras_base data struct specifically or maybe even the RAS component in general, but I see no other way to implement it now...
> What I've written for RAS (attached is my current patch for the 1.6 branch, incl. BTL and ODLS modules previously sent here) is a module which does 2 things (for mpirun -n X foo):
> 1. Waits for X slots to become available somewhere in the cluster (optional)
> 2. Create the allocation composed of the X best machines to use
> - This requires the RAS module to know the amount of slots to allocate in advance... is there a better way to do it? (in 1.6/trunk?)
> I tried to access the orte_job_t struct using my jobid from inside the ras module, but that struct isn't initialized with content at that time.
> Thanks,
> Alex
> P.S. I'm preparing a patch for both 1.6 branch and trunk because I want to do some benchmarking (note saying trunk is bad for this purpose) and I want it to be available in the long run. Am I missing something here? I hope I'll get the contributor paper signed so I can commit rather then working on my laptop...
> On 04/13/2012 07:43 PM, Ralph Castain wrote:
>> Looks like you are using an old version - the trunk RAS has changed a bit. I'll shortly be implementing further changes to support dynamic allocation requests that might be relevant here as well.
>> Adding job data to the RAS base isn't a good idea - remember, multiple jobs can be launching at the same time!
>> On Apr 13, 2012, at 10:07 AM, Alex Margolin wrote:
>>> Hi,
>>> The next component I'm writing is a component for allocating nodes to
>>> run the processes of an MPI job.
>>> Suppose I have a "getbestnode" executable which not only tells me the
>>> best location for spawning a new process,
>>> but it also reserves the space (for some time), so that every time I run
>>> it I get different results (as the best cores are already reserved).
>>> I thought I should write a component under orte/mca/ras, similar to
>>> loadleveler, but the problem is that I can't determine inside the module
>>> the amount of slots required allocate. It gets an list to fill in as a parameter, and
>>> I guess it assumes I somehow know how many processes are run because the
>>> allocation was done externally and now I'm just asking the allocator for
>>> the list.
>>> A related location, the rmaps, has this information (and much more), but
>>> it doesn't look like a good location for such a module since it maps
>>> already allocated resources, and has a lot of irrelevant code in this case.
>>> Maybe the answer is to change the base module a bit, to contain this
>>> information? It could be used as a decent sanity check for other modules
>>> - making sure the external allocation fits the amount of processes we
>>> intend to run. Maybe orte_ras_base_allocate(orte_job_t *jdata) in
>>> ras_base_allocate.c can store the relevant information from jdata in
>>> orte_ras_base? In the long run it can become a parameter passed to the
>>> ras components, but for backwards-compatability the global will do for now.
>>> Thanks,
>>> Alex
>>> P.S. An RDS component is elaborately mentioned in ras.h, yet it is no
>>> longer available, right?
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> <patch-openmpi-1.6>_______________________________________________
> devel mailing list
> devel_at_[hidden]