Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: David Erukhimovich (daviderukh_at_[hidden])
Date: 2007-10-21 06:24:41

Hi Ralph,

I'm sorry to bother you again but adding the new component to rds still
doesn't work as expected.
I've created a new component rds_mosix. it is identical to rds_hostfile
(with all parameters names changed) except:

in rds/mosix/rds_mosix_component.c/orte_rds_mosix_open:
   mca_base_param_reg_string("rds_hostfile", "path",
                              "ORTE Host filename",
                              false, false, path,

in rds/mosix/rds_mosix.c/orte_rds_mosix_query:
  rc = mca_base_param_find("rds", "hostfile", "path");
  mca_base_param_lookup_string(rc, &mca_rds_mosix_component.path);
  printf("got hostfile: %s\n", mca_rds_mosix_component.path);

So I'm running:
   mpirun --mca rmaps round_robin --mca rds mosix --hostfile $MOSHOME/4hosts -np 2 hostname

and getting the output: "got hostfile: <default_hostfile_path>"
and not the given path.

What am I doing wrong?

Thank you

---------- Forwarded message ----------
From: Ralph Castain <rhc_at_[hidden]>
Date: Oct 20, 2007 6:52 PM
Subject: Re: [OMPI devel] Trying to get total procs num in odls framework
To: David Erukhimovich <daviderukh_at_[hidden]>

On 10/20/07 10:10 AM, "David Erukhimovich" <daviderukh_at_[hidden]> wrote:

> Hi Ralph,
> 2. I do want the user to be able to switch between my way of process
> launching, and the default way. I can do it using an mca flag, but I would
> prefer a new component. If I is not too defficult for you, please make the
> patch, if it is, I'll just use an mca flag.

I can make it next week - shouldn't be too big a deal. I'll let you know if

> 1. Just remmembered another difficulty I had: I've created a new rds
> component identical to the hostfile one. lets call it mosix. Now, orterun
> is saving the hostfile path in the mca parameter - rds_hostfile_path or
> something like that. when I try to retrieve rds_hostfile_path or
> rds_mosix_path in rds_mosix component I always get the default hostfile
> (doesn't matter if I gave an hostfile or not). And I tried everything -
> changing names in rds_mosix_component, declaring a new parameter
> rds_mosix_path in various places etc. So now I'm just altering the
> hostfile component.
> Do you have any suggestions how to make it work?

How are you retrieving the path? Here is the code from hostfile:

                              "ORTE Host filename",
                              false, false, path,

If you look at that, it is actually looking for an mca param of
"rds_hostfile_path". If you just copied this code, though, using your
component's name, then you would be looking for the mca param
"rds_<your-components-name>_path". What you probably need to do is hardwire
it to:

    mca_base_param_reg_string("rds_hostfile", "path",
                              "ORTE Host filename",
                              false, false, path,

Also, you may be encountering a problem in that the rds_hostfile component
is going to try and run as well as your component, and thus may overwrite
what you do. You might want to try -mca rds my_component to ensure that only
your component gets executed.

> Sorry for all the questions and thank you very much for the quick answers

Not a problem - hope this helps.


> Regards
> --David
> ---------- Forwarded message ----------
> From: Ralph Castain <rhc_at_[hidden]>
> Date: Oct 20, 2007 5:12 PM
> Subject: Re: [OMPI devel] Trying to get total procs num in odls framework
> To: David Erukhimovich <daviderukh_at_[hidden]>
> Hi David
> Thanks for the info - see comments below.
> Ralph
> On 10/20/07 6:58 AM, "David Erukhimovich" <daviderukh_at_[hidden]>
>> Hi
>> Thank you for your answer.
>> First of all, my two questions wasn't connected and they belong to
> different
>> part of my project. and the subject of the mail should have been: Trying
> to
>> get total procs num in rds framework (sorry my mistake).
>> Here the parts in the order of the last email
>> 1. I've solved the problem about getting total num of procs in rds (just
>> called some function incorrectly), so sorry for disturbing you about
> that.
>> Now a bit more about what I'm trying to do, maybe there is a better way
> then
>> mine:
>> I have a tool (external application) that given a list of machines and a
>> number n , it chooses the n best ones from the list (least loaded ones)
> and
>> if the list of machines isn't given, it just returns the n best machines
>> from the claster. I am wishing to include this in ompi. hence - given a
>> machinefile, It'll run the process only on the best nodes. If a
> machinefile
>> isn't given, it'll take the best node that my application returns.
>> I think the best place to implement it is in rds - after building the
>> of newly discovered nodes: if it is empty, fill it using my tool,
> otherwise
>> filter it using my tool. It seems to me the most logical way to do it. Am
> I
>> right? I am asking you because I guess you have a better knowledge in
>> architecture.
> It sounds like the correct place to me. At some point in the future, you
> could migrate that logic to the RAS instead, but I would just continue as
> you are doing for now.
>> 2. The other thing I am trying to do is to make ompi to run every
>> not directly, but through external program. e.g: If I want to launch the
>> program "hostname", I want that following to be launched: "<my-program>
>> <my-program's-flags> hostname".
>> I figured that the best way to do it is in odls framework because there I
>> have the exact executing point.
> I guess I wouldn't do it that way if I were doing a project of my own. I
> would just go into the default odls module and hardcode the revised
> I can't see this coming back into the production system, so unless you
> some reason to want to run both with and without your revision, why go
> through the pain?
>> I am currently working on the checkpoint 1.2.3. I don't work on the trunk
>> because I need the patches to be added on some stable release. Is there a
>> 1.2.* release where the bug is fixed. And if not - when can such fixed
>> version be stable
> I don't think there are any plans to backport that fix, though I imagine
> could be done. If not, I could try and create a patch for you next week,
> though I would again suggest you just hardcode your change into the
> odls default component to make your life easier.
> Ralph
>> Thank you
>> --Davis
>> ---------- Forwarded message ----------
>> From: Ralph Castain <rhc_at_[hidden]>
>> Date: Oct 17, 2007 11:22 PM
>> Subject: Re: [OMPI devel] Trying to get total procs num in odls framework
>> To: daviderukh_at_[hidden]
>> Cc: "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]>
>> Hi David
>> I could probably answer your questions better if I had a better
>> understanding of what you are trying to do. For example, looking in the
>> hostfile rds for the number of procs to be launched seems strange as the
>> functional role of the framework is to simply learn what nodes are
>> available.
>> It would also help to have some idea of what environment you are working
> in,
>> and how you configured the beast.
>> Please see comments below.
>> Ralph
>> On 10/17/07 2:47 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>> Yo Ralph --
>>> Can you answer these questions?
>>> Begin forwarded message:
>>>> From: David Erukhimovich <daviderukh_at_[hidden]>
>>>> Date: October 14, 2007 5:08:45 PM EDT
>>>> To: devel_at_[hidden]
>>>> Subject: [OMPI devel] Trying to get total procs num in odls framework
>>>> Reply-To: Open MPI Developers <devel_at_[hidden]>
>>>> Hello,
>>>> I have 2 questions:
>>>> 1. I am trying to get the total number of requested processes for
>>>> the job
>>>> in' hostfile' component in rds. I took the job object that was
>>>> given as a
>>>> parameter, extracted the application objects and checked how many
>>>> procs
>>>> each application has. The result in every run was 0. As I
>>>> understand, this
>>>> variable is updated before the rds part. So what am I doing wrong?
>> Do you mean you took the jobid given to the hostfile RDS (which isn't an
>> object, but just a number) and did an orte_rmgr.get_app_context to get
>> array of app_contexts? Is there some reason why you would want to do that
>> there?
>> Depending upon what the command line looks like, it is possible for the
>> number of procs to be zero - we allow that option and then fill in the
>> number later. If it was specified, though, we do insert the number in the
>> app_context object.
>> Maybe you could tell me what the command line looks like, the function
> call
>> you used to get the "application objects", and what field you were
>> at when you found zero?
>>>> 2. I've discovered an undocumented framework - odls.
>> It wasn't exactly hidden...we haven't documented it because we are lazy
> and
>> the existing components cover every known environment (or so we thought).
>> ;-)
>> Is there some special reason to want to create another one?
>>>> I've created a
>>>> new
>>>> component for it. The problem is that there is no way to switch
>>>> between
>>>> the default component and mine (--mca odls <my component> doesn't
>>>> work).
>>>> Is there a way to switch between odls components (I saw bprocs
>>>> there and
>>>> I guess it is used)?
>> Are you working on the trunk? What r level?
>> Reason I ask: I recently fixed a problem where the command line mca
>> were not getting passed to the orteds. Your description looks like you
>> haven't picked up that change. If you have updated recently, and you
>> can't get it to work, then we likely have a lingering problem.
>> If I read your subject line correctly, then I am somewhat puzzled. You
>> look at the orte/mca/odls/base/odls_base_default_fns.c file, the
>> orte_odls_base_default_get_add_procs_data function and see where we get
> the
>> total number of procs in a job and how that is passed to the orteds. If
> you
>> have some new environment that the existing odls components can't handle,
>> then I would strongly suggest you at least use the default functions in
> the
>> base to provide as much support as possible as this will help you to keep
>> pace with changes in the system.
>> I would also welcome feedback on what you encountered that required a new
>> odls component - perhaps we can modify the base support functions to make
> it
>> fit within one of the existing components.
>> Thanks
>> Ralph
>>>> Thank you,
>>>> --David
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]