I guess I wasn't clear earlier - I don't know anything about how HP-MPI works. I was only theorizing that perhaps they did something different that results in some other slurm vars showing up in Brent's tests. From Brent's comments, I guess they don't - but they launch jobs in a different manner that results in some difference in the slurm envars seen by application procs.
On Feb 24, 2011, at 2:59 PM, Henderson, Brent wrote:
> They really can't be all SLURM_PROCID=0 - that is supposed to be unique for the job - right? It appears that the SLURM_PROCID is inherited from the orted parent - which makes a fair amount of sense given how things are launched.That's correct, and I can agree with your sentiment.
However, our design goals were to provide a consistent *Open MPI* experience across different launchers. Providing native access to the actual underlying launcher was a secondary goal. Balancing those two, you can see why we chose the model we did: our orted provides (nearly) the same functionality across all environments.
In SLURM's case, we propagate a [seemingly] non-sensical SLURM_PROCID values to the individual processes, but only if you are making an assumption about how Open MPI is using SLURM's launcher.
More specifically, our goal is to provide consistent *Open MPI information* (e.g., through the OMPI_COMM_WORLD* env variables) -- not emulate what SLURM would have done if MPI processes had been launched individually through srun. Even more specifically: we don't think that the exact underlying launching mechanism that OMPI uses is of interest to most users; we encourage them to use our portable mechanisms that work even if they move to another cluster with a different launcher. Admittedly, that does make it a little more challenging if you have to support multiple MPI implementations, and although that's an important consideration to us, it's not our first priority.
> Now to answer the other question - why are there some variables missing. It appears that when the orted processes are launched - via srun but only one per node, it is a subset of the main allocation and thus some of the environment variables are not the same (or missing entirely) as compared to launching them directly with srun on the full allocation. This also makes sense to me at some level, so I'm at peace with it now. :)
No worries; you were perfectly clear. Thanks!
> Last thing before I go. Please let me apologize for not being clear on what I disagreed with Ralph about in my last note. Clearly he nailed the orted launching process and spelled it out very clearly, but I don't believe that HP-MPI is not doing anything special to copy/fix up the SLURM environment variables. Hopefully that was clear by the body of that message.
For corporate legal information go to:
users mailing list