On 8/30/2011 7:34 PM, Ralph Castain wrote:
> On Aug 29, 2011, at 11:18 PM, Eugene Loh wrote:
>> Maybe someone can help me from having to think too hard.
>> Let's say I want to max my system limits. I can say this:
>> % mpirun --mca opal_set_max_sys_limits 1 ...
>> Meanwhile, if I do this:
>> % setenv OMPI_MCA_opal_set_max_sys_limits 1
>> % mpirun ...
>> remote processes don't see the setting. (Local processes and ompi_info are fine.)
> I looked at the 1.5 code, and mpirun is reaping all OMPI_ params from the environ and adding them to the app. So it should be getting set.
> I then ran "mpirun -n 1 printenv" on a slurm machine, and verified that indeed that param was in the environment. Ditto when I told it to use the rsh launcher.
>> Bug? Naively, this looks "wrong." At least disturbing, in any case.
>> This is with v1.5.
Okay, so one answer is implicit in your reply: you are expecting the
same result I am. So, if the behavior is not as I expect but as I
describe, it's a bug candidate. (As opposed to, "The problem you're
describing is how it's supposed to work; it's no problem at all.")
Now, regarding "mpirun -n 1 printenv", I agree that the environment
variable is getting set. Even on a remote node. That suggests that
things are fine, but it turns out they are not. The problem is -- and
I'm afraid I don't understand the details -- it's set "too late." I
imagine a time line like this:
A) orted starts
B) orted calls opal_util_init_sys_limits()
C) daemonize a child process
D) child process execs target process
E) target process starts up
Looking at the environment, I don't see the variable set in B, which is
the only place the variable does any good. Like you, I do see it in E,
which is interesting but doesn't help the user.
Your experiment was reasonable, but the problem is odd. I suggest the
following to see the problem. Set the variable in your environment.
Then use mpirun to launch a remote process. Then:
1) In the remote orted, inside opal_util_init_sys_limits(), check for
the variable in your environment.
2) Make the remotely launched process something like this:
and see if the descriptor limit got bumped up from what it otherwise
In contrast, if you set the MCA parameter on your mpirun command line,
the environment variable *does* get set, even in the environment of the
orted when it calls opal_util_init_sys_limits().
I can poke at this more tomorrow, but I suspect with one "aha!" you'll
figure it out a lot faster than I can. :^(