Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] alps ras patch for SLURM
From: Jerome Soumagne (soumagne_at_[hidden])
Date: 2010-07-09 12:53:44


I would prefer the first patch though so that we get rid of scripts and
of another env variable but well, I let you choose.

Jerome

On 07/09/2010 06:27 PM, Jerome Soumagne wrote:
> Hi Ken,
>
> That's interesting, setting the OMPI_ALPS_RESID in the modules so that
> it executes the ras-alps-command.sh is a good idea. In this case
> another way would be to add an extra line in this script with the
> BASIL_RESERVATION_ID as you did for the BATCH_PARTITION_ID.
> I have another possible patch then:
>
> Index: ras-alps-command.sh
> ===================================================================
> --- ras-alps-command.sh (revision 23365)
> +++ ras-alps-command.sh (working copy)
> @@ -22,6 +22,13 @@
> exit 0
> fi
>
> + # If the SLURM BASIL_RESERVATION_ID is set, use it.
> + if [ "${BASIL_RESERVATION_ID}" != "" ]
> + then
> + ${ECHO} ${BASIL_RESERVATION_ID}
> + exit 0
> + fi
> +
> # Extract the batch job ID directly from the environment, if available.
> jid=${BATCH_JOBID:--1}
> if [ $jid -eq -1 ]
>
>
> Thanks for your help in the clarification.
>
> Jerome
>
> On 07/09/2010 05:41 PM, Matney Sr, Kenneth D. wrote:
>> Hi Jerome,
>>
>> I am in part responsible for the current incarnation of the ALPS support in OMPI. We use the
>> modules environment to set OMPI_ALPS_RESID to the ALPS reservation ID, the pertinent
>> parts of which are:
>>
>> set ridpath ${basedir}/share/openmpi
>> set ridname ras-alps-command.sh
>> set rid ${ridpath}/${ridname}
>>
>> # Set local cluster parameters for XT5.
>> set resId [exec /bin/bash ${rid}]
>> setenv OMPI_ALPS_RESID $resId
>>
>> Originally, the Cray XT systems automatically set an environmental variable, BATCH_PARTITION_ID
>> to the ALPS reservation ID for the job. However, newer versions do not expose the ALPS reservation
>> ID to the user. So, we need a way to get the ALPS reservation ID of the Torque job. Unfortunately,
>> Cray has not made the internal structure of ALPS that does this available. So, we are forced to use
>> apstat to get this information. But, apstat is not as robust as we might like. Ergo, the script is used to
>> loop on apstat until it does not fail. In the end, we obtain the ALPS reservation ID for the current
>> Torque job and set it to OMPI_ALPS_RESID. I chose this name so as to avoid namespace conflicts.
>>
>> So, the ALPS RAS mca is being selected, because your patch tells the ALPS RAS mca that
>> BASIL_RESERVATION_ID is equivalent to OMPI_ALPS_RESID. In turn, while you invoke OMPI with
>> mpirun, the OMPI version of mpirun will select the ALPS PLM mca. This will launch your job with an
>> aprun (under the covers). So, your job does show a successful run. However, you may not be taking
>> the path through mpirun that you intended.
>>
>> I do hope that I have cleared up some confusion.
>> --
>> Ken Matney, Sr.
>> Oak Ridge National Laboratory
>>
>>
>> On Jul 9, 2010, at 6:19 AM, Jerome Soumagne wrote:
>>
>> Hi,
>>
>> We've recently installed OpenMPI on one of our Cray XT5 machines, here at CSCS. This machine uses SLURM for launching jobs.
>> Doing an salloc defines this environment variable:
>> BASIL_RESERVATION_ID
>> The reservation ID on Cray systems running ALPS/BASIL only.
>>
>> Since the alps ras module tries to find a variable called OMPI_ALPS_RESID which is set using a script, we thought that for SLURM systems it would be a good idea to directly integrate this BASIL_RESERVATION_ID variable in the code, rather than using a script. The small patch is attached.
>>
>> Regards,
>>
>> Jerome
>>
>> --
>> Jérôme Soumagne
>> Scientific Computing Research Group
>> CSCS, Swiss National Supercomputing Centre
>> Galleria 2, Via Cantonale | Tel: +41 (0)91 610 8258
>> CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282
>>
>>
>>
>> <patch_slurm_alps.txt><ATT00001..txt>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel