This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
> there appear to be some overlaps between the ls_* and lsb_* functions,
> but they seem basically compatible as far as i can tell. almost all
> the functions have a command line version as well, for example:
Like openmpi and orte, there are two layers in LSF. The ls_* API's
talk to what is/was historically called "LSF Base" and the lsb_* API's
talk to what is/was historically called "LSF Batch".
The ls_* API's are essentially "do it now" type functionality for
writing distributed applications that do not require batch
The ls_* functions do not honour any batch allocation or policy in
> lsb_getalloc()/none and lsb_launch()/blaunch are new with LSF 7.0, but
> appear to just be a different (simpler) interface to existing
> functionality in the LSB_* env vars and the ls_rexec()/lsgrun commands
> -- although, as you say, perhaps platform will hook or enhance them
> later. but, the key issue is that lsb_launch() just starts tasks -- it
> does not perform or interact with the queue or job control (much?).
> so, you can't use these functions to get an allocation in the first
> place, and you have to be careful not to use them as a way around the
> queuing system.
ls_* api's do not honour a batch allocation, while lsb_launch does.
lsb_launch will only allow you to start tasks on nodes allocated to
your jobs, and is subject to all the queue/job controls.
ls_rexec/lsgrun are not used to start batch jobs
In pre-7.0, the method for starting openmpi is essentially:
$bsub -n N -a openmpi mpirun.lsf a.out
Note that you only have the openmpi method and mpirun.lsf if you have
installed the hpc extensions.
> [ as a side note, the function ls_rexecv()/lsgrun is the one i have
> heard admins do not like because it can break queuing/accounting, and
> might try to disable somehow. i don't really buy that, because it's
> not you can disable it and have the system still work, since (as
> above) || job launching depends on it. i guess if you really don't
> care about || launching maybe you could. but, if used properly after a
> proper allocation i don't think there should (or even can) be a
> problem. ]
Job launching does not depend on it; and admins can explicitly
turn off support for ls_rexec/lsgrun while allowing lsb_launch to
continue to function -- thus ensuring that tasks of any form can only
be started on nodes allocated to the job.
> so, lsb_submit()/bsub is a combination allocate/launch -- you specify
> the allocation size you want, and when it's all ready, it runs the
> 'job' (really the job launcher) only on one (randomly chosen) 'head'
> node from the allocation, with the env vars set so the launcher can
> use ls_rexec/lsgrun functions to start the rest of the job. there are
> of course various script wrappers you can use (mpijob, pvmjob, etc)
> instead of your 'real job'. then, i think lsf *should* try to track
> what processes get started via the wrapper / head process so it knows
> they are part of the same job. i dunno if it really does that -- but,
> my guess is that at the least it assumes the allocation is in use
> until the original process ends. in any case, the wrapper / head
> process examines the environment vars and uses ls_rexec()/lsgrun or
> the like to actually run N copies of the 'real job' executable. in
> 7.0, it can conveniently use lsb_getalloc() and lsb_launch(), but that
> doesn't really change any semantics as far as i know. one could
> imaging that calling lsb_launch() instead of ls_rexec() might be
> preferable from a process tracking point of view, but i don't see why
> Platform couldn't hook ls_rexec() just as well as lsb_launch().
ls_rexec does not honour batch semantics. Prior to LSF7 there is
an additional parallel application manager that is started when the
-a openmpi option is specified. It handles I/O marshalling, signaling
and task accounting for the complete parallel job across all nodes.
In LSF7, this functionaly has been embedded directly into the RES
and is invoked when lsb_launch is used.
yes you could use ls_rexec but it does not handle the I/O and process
marshalling - you need to handle that yourself if you use ls_rexec.
The first node is node random, it is the "best" match within the
based on the resource requirements for the job
Since you are refering to the mpijob/pvmjob scripts I would guess
you do not have the HPC extensions installed, as these are fairly
simplistic wrappers that don't make use of the parallel application
> there is also an lsb_runjob() that is similar to lsb_launch(), but for
> an already submitted job. so, if one were to lsb_sumbit() with an
> option set to never launch it automatically, and then one were to run
> lsb_runjob(), you can avoid the queue and/or force the use of certain
> hosts? i guess this is also not a good function to use, but at least
> the queuing system would be aware of any bad behavior (queue skipping
> via ls_placereq() to get extra hosts, for instance) in this case ...
Not really - lsb_runjob() is essentially an admin function to force
a job to run irrespective of the current
Unless you have administrator privs it will fail.
As for growing or shrinking the allocation for a job, that is on the
the roadmap for the near future. However, as Jeff has previously
mentioned, on a busy system you could end up waiting for a long time
to get additional nodes.
Essentially it boils down to make an asynchronous request for
resources and registering a callback for when something can be
Principal Technical Product Manager