Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] alps ras patch for SLURM
From: Jerome Soumagne (soumagne_at_[hidden])
Date: 2010-07-09 12:27:20


Hi Ken,

That's interesting, setting the OMPI_ALPS_RESID in the modules so that
it executes the ras-alps-command.sh is a good idea. In this case another
way would be to add an extra line in this script with the
BASIL_RESERVATION_ID as you did for the BATCH_PARTITION_ID.
I have another possible patch then:

Index: ras-alps-command.sh
===================================================================
--- ras-alps-command.sh (revision 23365)
+++ ras-alps-command.sh (working copy)
@@ -22,6 +22,13 @@
      exit 0
    fi

+ # If the SLURM BASIL_RESERVATION_ID is set, use it.
+ if [ "${BASIL_RESERVATION_ID}" != "" ]
+ then
+ ${ECHO} ${BASIL_RESERVATION_ID}
+ exit 0
+ fi
+
  # Extract the batch job ID directly from the environment, if available.
    jid=${BATCH_JOBID:--1}
    if [ $jid -eq -1 ]

Thanks for your help in the clarification.

Jerome

On 07/09/2010 05:41 PM, Matney Sr, Kenneth D. wrote:
> Hi Jerome,
>
> I am in part responsible for the current incarnation of the ALPS support in OMPI. We use the
> modules environment to set OMPI_ALPS_RESID to the ALPS reservation ID, the pertinent
> parts of which are:
>
> set ridpath ${basedir}/share/openmpi
> set ridname ras-alps-command.sh
> set rid ${ridpath}/${ridname}
>
> # Set local cluster parameters for XT5.
> set resId [exec /bin/bash ${rid}]
> setenv OMPI_ALPS_RESID $resId
>
> Originally, the Cray XT systems automatically set an environmental variable, BATCH_PARTITION_ID
> to the ALPS reservation ID for the job. However, newer versions do not expose the ALPS reservation
> ID to the user. So, we need a way to get the ALPS reservation ID of the Torque job. Unfortunately,
> Cray has not made the internal structure of ALPS that does this available. So, we are forced to use
> apstat to get this information. But, apstat is not as robust as we might like. Ergo, the script is used to
> loop on apstat until it does not fail. In the end, we obtain the ALPS reservation ID for the current
> Torque job and set it to OMPI_ALPS_RESID. I chose this name so as to avoid namespace conflicts.
>
> So, the ALPS RAS mca is being selected, because your patch tells the ALPS RAS mca that
> BASIL_RESERVATION_ID is equivalent to OMPI_ALPS_RESID. In turn, while you invoke OMPI with
> mpirun, the OMPI version of mpirun will select the ALPS PLM mca. This will launch your job with an
> aprun (under the covers). So, your job does show a successful run. However, you may not be taking
> the path through mpirun that you intended.
>
> I do hope that I have cleared up some confusion.
> --
> Ken Matney, Sr.
> Oak Ridge National Laboratory
>
>
> On Jul 9, 2010, at 6:19 AM, Jerome Soumagne wrote:
>
> Hi,
>
> We've recently installed OpenMPI on one of our Cray XT5 machines, here at CSCS. This machine uses SLURM for launching jobs.
> Doing an salloc defines this environment variable:
> BASIL_RESERVATION_ID
> The reservation ID on Cray systems running ALPS/BASIL only.
>
> Since the alps ras module tries to find a variable called OMPI_ALPS_RESID which is set using a script, we thought that for SLURM systems it would be a good idea to directly integrate this BASIL_RESERVATION_ID variable in the code, rather than using a script. The small patch is attached.
>
> Regards,
>
> Jerome
>
> --
> Jérôme Soumagne
> Scientific Computing Research Group
> CSCS, Swiss National Supercomputing Centre
> Galleria 2, Via Cantonale | Tel: +41 (0)91 610 8258
> CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282
>
>
>
> <patch_slurm_alps.txt><ATT00001..txt>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>