Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Ralph H Castain (rhc_at_[hidden])
Date: 2007-01-29 09:47:37


On 1/27/07 9:37 AM, "Greg Watson" <gwatson_at_[hidden]> wrote:

> There are two more interfaces that have changed:
>
> 1. orte_rds.query() now takes a job id, whereas in 1.2b1 it didn't
> take any arguments. I seem to remember that I call this to kick orted
> into action, but I'm not sure of the implications of not calling it.
> In any case, I don't have a job id when I call it, so what do I pass
> to get the old behavior?

For now, you can just use ORTE_JOBID_INVALID (defined in
orte/mca/ns/ns_types.h).

However, your question raises a flag. You should be calling
orte_rmgr.setup_job before you call the RDS, and that function returns the
jobid for your job. Failing to call setup_job first may cause other parts of
the code base to fail as they are expecting certain data to be setup in the
registry by setup_job.

If you do call setup_job first, then just pass the returned jobid along to
rds.query.

>
> 2. orte_pls.terminate_job() now takes a list of attributes in
> addition to a job id. What are the attributes for, and what happens
> if I pass a NULL here? Do I need to crate an empty attribute list?
>

You can always pass a NULL to any function looking for attributes - the
system knows how to handle that situation.

What you should pass here depends upon what you are trying to do. If you
just want to terminate a specific job, then you can just pass a NULL.
However, if you want to terminate the specified job AND any "children" that
were dynamically spawned by that job, then you need to pass the
ORTE_NS_INCLUDE_DESCENDANTS attribute - something like the following code
snippet (pulled from orterun) would work:

#include "opal/class/opal_list.h"

#include "orte/mca/pls/pls.h"
#include "orte/mca/rmgr/rmgr.h"
#include "orte/mca/ns/ns_types.h"
#include "orte/runtime/params.h"

    opal_list_t attrs;
    opal_list_item_t *item;

    OBJ_CONSTRUCT(&attrs, opal_list_t);
    orte_rmgr.add_attribute(&attrs, ORTE_NS_INCLUDE_DESCENDANTS, ORTE_UNDEF,
NULL, ORTE_RMGR_ATTR_OVERRIDE);
    ret = orte_pls.terminate_job(jobid, &orte_abort_timeout, &attrs);
    while (NULL != (item = opal_list_remove_first(&attrs)))
OBJ_RELEASE(item);
    OBJ_DESTRUCT(&attrs);

Please note that the orte_pls.terminate_job API in 1.2 will undergo a change
in the next few days (it already is changed in the trunk). The change,
included in the code snippet above, adds a timeout capability to have the
function "give up" if the job doesn't terminate within the specified time.
The parameter given above references the orte-wide default value (adjustable
via MCA param), but you can give it anything you like - a NULL for the
timeout param means don't timeout so we'll try until you order us to quit.

> Greg
>
>
> On Jan 27, 2007, at 6:51 AM, Ralph Castain wrote:
>
>>
>>
>>
>> On 1/26/07 11:36 PM, "Greg Watson" <gwatson_at_[hidden]> wrote:
>>
>>> I have been using this define to implement the orte_stage_gate_init()
>>> functionality in PTP using OpenMPI 1.2b1 for some months now. As of
>>> 1.2b3 it appears that this define has been removed. New users
>>> attempting to build PTP against the latest 1.2b3 build are
>>> complaining that they are getting build errors.
>>>
>>> Please let me know what has replaced this define in 1.2b3, and how we
>>> can obtain the same functionality that we had in 1.2b1.
>>
>> You need to use ORTE_PROC_MY_HNP - no API change is involved, it is
>> just a
>> #define. You may need to add #include "orte/mca/ns/ns_types.h" to
>> pick it
>> up.
>>
>> You will also find that ORTE_RML_NAME_ANY is likewise gone - you
>> need to use
>> ORTE_NAME_WILDCARD instead for the same reasons as described below.
>> Similarly, ORTE_RML_NAME_SELF has been replaced by ORTE_PROC_MY_NAME.
>>
>> We discovered during the testing/debugging of 1.2 that people had
>> unintentionally created multiple definitions for several critical
>> names in
>> the system. Hence, we had an ORTE_RML_NAME_SEED, an ORTE_OOB_SEED, and
>> several others. In the event that definition had to change, we
>> found the
>> code "cracking" all over the place - it was literally impossible to
>> maintain.
>>
>> So we bit the bullet and cleaned it up. No API changes were
>> involved, but we
>> did remove duplicative defines (and their associated storage
>> locations).
>> Hopefully, people will take the time to lookup and use these system-
>> level
>> defines instead of re-creating the problem!
>>
>>>
>>> Also, I would like to know what the policy of changing interfaces is,
>>> and when in the release cycle you freeze API changes. It's going to
>>> be extremely difficult to release a version of PTP built against
>>> OpenMPI if you change interfaces between beta versions.
>>
>> In my opinion, that is what "beta" is for - it isn't a "lock-down"
>> release,
>> but rather a time to find your cracks and fix them. That said, we
>> don't
>> change APIs for no reason, but only because we either (a) needed to
>> do so to
>> add some requested functionality (e.g., the recent request for
>> "pernode"
>> launch capabilities), or (b) found a bug in the system that
>> required some
>> change or added functionality to fix (e.g., the recent changes in
>> the PLS
>> behavior and API to support ctrl-c interrupt capabilities).
>>
>> I generally try to send emails out alerting people to these changes
>> when
>> they occur (in fact, I'm pretty certain I sent one out on this issue).
>> However, looking back, I find that I send them to the OMPI "core
>> developers"
>> list - not the "developers" one. I note that the OMPI layer
>> developers tend
>> to do the same thing. I'll try to rectify that in the future and
>> suggest my
>> OMPI compatriots do so too.
>>
>> Once an actual release is made, we only make an API change if a
>> major bug is
>> found and an API change simply must be done to fix it. I don't
>> recall such
>> an instance, though I think it may have happened once between minor
>> release
>> numbers in the 1.1 family (not sure).
>>
>>
>>>
>>> Greg
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel