Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] orte_odls_base_default_launch_local()
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-07-14 10:48:03


Sorry for the delay - got tied up. This looks sane to me and should work.

FWIW: I haven't seen any problem with comm_spawn_multiple. That code will only executes if the specific limit is hit, so I suspect that is an issue of scale and race conditions.

On Jul 12, 2011, at 6:44 PM, Eugene Loh wrote:

> Thanks for the quick and helpful response.
>
> I'm willing to make the changes if only I were more confident I understood the issues.
>
> What you call the first usage of num_procs_alive seems to count "local procs alive or about to be launched". It tests on "child->alive || opal_dss.compare(&job, &(child->name->jobid), ORTE_JOBID))". This is only in the !oversubscribed case.
>
> What you call "the later usage" ("0 < opal_sys_limits.num_procs" or "check the system limits") seems to want only the number of alive procs (not including any that are about to be launched). So, as I guess you were driving at, a fresh computation of num_procs_alive seems to be needed at this point, both in v1.5 and in the trunk.
>
> There is actually a third usage of num_procs_alive ("available file descriptors" or "0 < opal_sys_limits.num_files"). I think there are two problems here. One is that the number of procs we're about to launch is not accounted for. Secondly, and perhaps most importantly, if we "wait a little time" and recompute num_procs_alive, we loop over item/child. This is bad since we are already inside an item/child loop! (This is actually the origin of my foray into this mess... MPI_Comm_spawn_multiple gets screwed up by this reuse of variable names.)
>
> I'm attaching an original (trunk) file, a proposed version, and the diff file. Let me know if it looks sane. If so, I'll try it out and put it back and then file a CMR for 1.5.
>
> Some more below...
>
> On 07/12/11 18:19, Ralph Castain wrote:
>
>> Well, yes and no.
>>
>> Yes - the value definitely needs to be computed in all cases since it is used later on.
>>
>> No - the way it is computed is only correct for that one usage. The later usage (in the block that starts with 0< opal_sys_limits.num_procs)) needs a completely different value.
>>
>> The current computation only comes up with the number of procs for the specific job being launched. The latter code requires the total number of local procs that are alive - a rather different value.
>
> The first usage, to compare with num_processors, includes not only procs for the job being launched but also already-alive procs.
>
> The second usage, to be compared to opal_sys_limits.num_procs, should only have alive procs since app->num_procs will be added later. (Plus, we try to keep a running total of this number of alive procs.)
>
> The third usage, to be compared to opal_sys_limits.num_files, should also only have alive procs for the same reasons.
>
>> So v1.5 is incorrect in either case. Would you like me to create a patch to correct it, or would you like to do it (covering both cases)?
>
> Trunk also appears to be incorrect.
>
>> On Jul 12, 2011, at 3:35 PM, Eugene Loh wrote:
>>> The function orte_odls_base_default_launch_local() has a variable num_procs_alive that is basically initialized like this:
>>>
>>> if ( oversubscribed ) {
>>> ...
>>> } else {
>>> num_procs_alive = ...;
>>> }
>>>
>>> Specifically, if the "oversubscribed" test passes, the variable is not initialized.
>>>
>>> (Strictly speaking, this is true only in v1.5. In the trunk, the variable is set to 0 when it is declared, but I'm not sure that's very helpful.)
>>>
>>> I'm inclined to move the num_procs_alive computation ahead of the "if" block so that this computation is always performed.
> <files.tar.gz>_______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel