Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: David Erukhimovich (daviderukh_at_[hidden])
Date: 2007-10-29 19:24:48


Hi,
I was just reviewing my files in order to sent them to Jeff, And fixed the
problem!!
I should've written:
        mca_base_param_string_name("rds_hostfile", "path" . . . );
instead if:
        mca_base_param_string("rds_hostfile", "path" . . .);
in the component file, 'open' function.

But I don't understand how it compiled? The is no function
mca_base_param_string that takes string as first param (I know it doesn't
comple in the module file)
I compile using 'make all install' in the openmpi dir

Thanks
--David

On Mon, 29 Oct 2007, Jeff Squyres wrote:

> Sorry guys, I did miss this earlier.
>
> I don't see a patch anywhere in the e-mail thread below -- can
> someone send me the problematic code in question?
>
> FWIW: The MCA param space is global, so there's no reason that a new/
> different RDS shouldn't be able to read the hostfile MCA parameter.
>
>
>
> On Oct 28, 2007, at 2:09 PM, Ralph Castain wrote:
>
> > Yo Jeff
> >
> > This may have slipped through your inbox (had OMPI devel in
> > subject, so may
> > have been caught in some filter) - could you please provide any
> > thoughts on
> > why the hostfile isn't getting picked up correctly? As I indicated
> > on the
> > prior note, I verified that it is working for the default hostfile
> > component
> > - I can't see anything wrong in David's call to cause the problem.
> > Please
> > refer to the prior note for that code.
> >
> > Thanks
> > Ralph
> >
> >
> >
> > On 10/28/07 10:31 AM, "David Erukhimovich"
> > <daviderukh_at_[hidden]> wrote:
> >
> >> Thank you very much for the patch, it helped me a lot (It works!) and
> >> I'm really appreciate this.
> >>
> >> p.s. Any idea about the rds thing?
> >>
> >> Regards
> >> --David
> >>
> >>
> >> Ralph H Castain wrote:
> >>> Hi David
> >>>
> >>> Here is the promised patch - it passes params just fine, but I
> >>> cannot vouch
> >>> for any unintended consequences. I -think- it will be fine, but
> >>> it lacks all
> >>> the usual testing for a patch to an official release.
> >>>
> >>> Hope it helps
> >>> Ralph
> >>>
> >>>
> >>>
> >>> On 10/20/07 10:10 AM, "David Erukhimovich"
> >>> <daviderukh_at_[hidden]> wrote:
> >>>
> >>>>
> >>>> Hi Ralph,
> >>>>
> >>>> 2. I do want the user to be able to switch between my way of
> >>>> process
> >>>> launching, and the default way. I can do it using an mca flag,
> >>>> but I would
> >>>> prefer a new component. If I is not too defficult for you,
> >>>> please make the
> >>>> patch, if it is, I'll just use an mca flag.
> >>>>
> >>>> 1. Just remmembered another difficulty I had: I've created a new
> >>>> rds
> >>>> component identical to the hostfile one. lets call it mosix.
> >>>> Now, orterun
> >>>> is saving the hostfile path in the mca parameter -
> >>>> rds_hostfile_path or
> >>>> something like that. when I try to retrieve rds_hostfile_path or
> >>>> rds_mosix_path in rds_mosix component I always get the default
> >>>> hostfile path
> >>>> (doesn't matter if I gave an hostfile or not). And I tried
> >>>> everything -
> >>>> changing names in rds_mosix_component, declaring a new parameter
> >>>> rds_mosix_path in various places etc. So now I'm just altering
> >>>> the existing
> >>>> hostfile component.
> >>>> Do you have any suggestions how to make it work?
> >>>>
> >>>> Sorry for all the questions and thank you very much for the
> >>>> quick answers
> >>>>
> >>>> Regards
> >>>> --David
> >>>>
> >>>> ---------- Forwarded message ----------
> >>>> From: Ralph Castain <rhc_at_[hidden]>
> >>>> Date: Oct 20, 2007 5:12 PM
> >>>> Subject: Re: [OMPI devel] Trying to get total procs num in odls
> >>>> framework
> >>>> To: David Erukhimovich <daviderukh_at_[hidden]>
> >>>>
> >>>> Hi David
> >>>>
> >>>> Thanks for the info - see comments below.
> >>>>
> >>>> Ralph
> >>>>
> >>>>
> >>>> On 10/20/07 6:58 AM, "David Erukhimovich"
> >>>> <daviderukh_at_[hidden]> wrote:
> >>>>
> >>>>> Hi
> >>>>> Thank you for your answer.
> >>>>>
> >>>>> First of all, my two questions wasn't connected and they belong to
> >>>> different
> >>>>> part of my project. and the subject of the mail should have
> >>>>> been: Trying
> >>>> to
> >>>>> get total procs num in rds framework (sorry my mistake).
> >>>>>
> >>>>> Here the parts in the order of the last email
> >>>>>
> >>>>> 1. I've solved the problem about getting total num of procs in
> >>>>> rds (just
> >>>>> called some function incorrectly), so sorry for disturbing you
> >>>>> about
> >>>> that.
> >>>>> Now a bit more about what I'm trying to do, maybe there is a
> >>>>> better way
> >>>> then
> >>>>> mine:
> >>>>> I have a tool (external application) that given a list of
> >>>>> machines and a
> >>>>> number n , it chooses the n best ones from the list (least
> >>>>> loaded ones)
> >>>> and
> >>>>> if the list of machines isn't given, it just returns the n best
> >>>>> machines
> >>>>> from the claster. I am wishing to include this in ompi. hence -
> >>>>> given a
> >>>>> machinefile, It'll run the process only on the best nodes. If a
> >>>> machinefile
> >>>>> isn't given, it'll take the best node that my application returns.
> >>>>> I think the best place to implement it is in rds - after
> >>>>> building the list
> >>>>> of newly discovered nodes: if it is empty, fill it using my tool,
> >>>> otherwise
> >>>>> filter it using my tool. It seems to me the most logical way to
> >>>>> do it. Am
> >>>> I
> >>>>> right? I am asking you because I guess you have a better
> >>>>> knowledge in ompi
> >>>>> architecture.
> >>>> It sounds like the correct place to me. At some point in the
> >>>> future, you
> >>>> could migrate that logic to the RAS instead, but I would just
> >>>> continue as
> >>>> you are doing for now.
> >>>>
> >>>>> 2. The other thing I am trying to do is to make ompi to run
> >>>>> every process,
> >>>>> not directly, but through external program. e.g: If I want to
> >>>>> launch the
> >>>>> program "hostname", I want that following to be launched: "<my-
> >>>>> program>
> >>>>> <my-program's-flags> hostname".
> >>>>> I figured that the best way to do it is in odls framework
> >>>>> because there I
> >>>>> have the exact executing point.
> >>>> I guess I wouldn't do it that way if I were doing a project of
> >>>> my own. I
> >>>> would just go into the default odls module and hardcode the
> >>>> revised launch.
> >>>> I can't see this coming back into the production system, so
> >>>> unless you have
> >>>> some reason to want to run both with and without your revision,
> >>>> why go
> >>>> through the pain?
> >>>>
> >>>>> I am currently working on the checkpoint 1.2.3. I don't work on
> >>>>> the trunk
> >>>>> because I need the patches to be added on some stable release.
> >>>>> Is there a
> >>>>> 1.2.* release where the bug is fixed. And if not - when can
> >>>>> such fixed
> >>>>> version be stable
> >>>> I don't think there are any plans to backport that fix, though I
> >>>> imagine it
> >>>> could be done. If not, I could try and create a patch for you
> >>>> next week,
> >>>> though I would again suggest you just hardcode your change into
> >>>> the existing
> >>>> odls default component to make your life easier.
> >>>>
> >>>> Ralph
> >>>>
> >>>>> Thank you
> >>>>> --Davis
> >>>>>
> >>>>> ---------- Forwarded message ----------
> >>>>> From: Ralph Castain <rhc_at_[hidden]>
> >>>>> Date: Oct 17, 2007 11:22 PM
> >>>>> Subject: Re: [OMPI devel] Trying to get total procs num in odls
> >>>>> framework
> >>>>> To: daviderukh_at_[hidden]
> >>>>> Cc: "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]>
> >>>>>
> >>>>> Hi David
> >>>>>
> >>>>> I could probably answer your questions better if I had a better
> >>>>> understanding of what you are trying to do. For example,
> >>>>> looking in the
> >>>>> hostfile rds for the number of procs to be launched seems
> >>>>> strange as the
> >>>>> functional role of the framework is to simply learn what nodes are
> >>>>> available.
> >>>>>
> >>>>> It would also help to have some idea of what environment you
> >>>>> are working
> >>>> in,
> >>>>> and how you configured the beast.
> >>>>>
> >>>>> Please see comments below.
> >>>>> Ralph
> >>>>>
> >>>>>
> >>>>> On 10/17/07 2:47 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
> >>>>>
> >>>>>> Yo Ralph --
> >>>>>>
> >>>>>> Can you answer these questions?
> >>>>>>
> >>>>>> Begin forwarded message:
> >>>>>>
> >>>>>>> From: David Erukhimovich <daviderukh_at_[hidden]>
> >>>>>>> Date: October 14, 2007 5:08:45 PM EDT
> >>>>>>> To: devel_at_[hidden]
> >>>>>>> Subject: [OMPI devel] Trying to get total procs num in odls
> >>>>>>> framework
> >>>>>>> Reply-To: Open MPI Developers <devel_at_[hidden]>
> >>>>>>>
> >>>>>>> Hello,
> >>>>>>> I have 2 questions:
> >>>>>>> 1. I am trying to get the total number of requested processes
> >>>>>>> for
> >>>>>>> the job
> >>>>>>> in' hostfile' component in rds. I took the job object that was
> >>>>>>> given as a
> >>>>>>> parameter, extracted the application objects and checked how
> >>>>>>> many
> >>>>>>> procs
> >>>>>>> each application has. The result in every run was 0. As I
> >>>>>>> understand, this
> >>>>>>> variable is updated before the rds part. So what am I doing
> >>>>>>> wrong?
> >>>>> Do you mean you took the jobid given to the hostfile RDS (which
> >>>>> isn't an
> >>>>> object, but just a number) and did an orte_rmgr.get_app_context
> >>>>> to get the
> >>>>> array of app_contexts? Is there some reason why you would want
> >>>>> to do that
> >>>>> there?
> >>>>>
> >>>>> Depending upon what the command line looks like, it is possible
> >>>>> for the
> >>>>> number of procs to be zero - we allow that option and then fill
> >>>>> in the
> >>>>> number later. If it was specified, though, we do insert the
> >>>>> number in the
> >>>>> app_context object.
> >>>>>
> >>>>> Maybe you could tell me what the command line looks like, the
> >>>>> function
> >>>> call
> >>>>> you used to get the "application objects", and what field you
> >>>>> were looking
> >>>>> at when you found zero?
> >>>>>
> >>>>>>> 2. I've discovered an undocumented framework - odls.
> >>>>> It wasn't exactly hidden...we haven't documented it because we
> >>>>> are lazy
> >>>> and
> >>>>> the existing components cover every known environment (or so we
> >>>>> thought).
> >>>>> ;-)
> >>>>>
> >>>>> Is there some special reason to want to create another one?
> >>>>>
> >>>>>>> I've created a
> >>>>>>> new
> >>>>>>> component for it. The problem is that there is no way to switch
> >>>>>>> between
> >>>>>>> the default component and mine (--mca odls <my component>
> >>>>>>> doesn't
> >>>>>>> work).
> >>>>>>> Is there a way to switch between odls components (I saw bprocs
> >>>>>>> there and
> >>>>>>> I guess it is used)?
> >>>>> Are you working on the trunk? What r level?
> >>>>>
> >>>>> Reason I ask: I recently fixed a problem where the command line
> >>>>> mca params
> >>>>> were not getting passed to the orteds. Your description looks
> >>>>> like you
> >>>>> haven't picked up that change. If you have updated recently,
> >>>>> and you still
> >>>>> can't get it to work, then we likely have a lingering problem.
> >>>>>
> >>>>>
> >>>>> If I read your subject line correctly, then I am somewhat
> >>>>> puzzled. You can
> >>>>> look at the orte/mca/odls/base/odls_base_default_fns.c file, the
> >>>>> orte_odls_base_default_get_add_procs_data function and see
> >>>>> where we get
> >>>> the
> >>>>> total number of procs in a job and how that is passed to the
> >>>>> orteds. If
> >>>> you
> >>>>> have some new environment that the existing odls components
> >>>>> can't handle,
> >>>>> then I would strongly suggest you at least use the default
> >>>>> functions in
> >>>> the
> >>>>> base to provide as much support as possible as this will help
> >>>>> you to keep
> >>>>> pace with changes in the system.
> >>>>>
> >>>>> I would also welcome feedback on what you encountered that
> >>>>> required a new
> >>>>> odls component - perhaps we can modify the base support
> >>>>> functions to make
> >>>> it
> >>>>> fit within one of the existing components.
> >>>>>
> >>>>> Thanks
> >>>>> Ralph
> >>>>>
> >>>>>
> >>>>>>> Thank you,
> >>>>>>> --David
> >>>>>>> _______________________________________________
> >>>>>>> devel mailing list
> >>>>>>> devel_at_[hidden]
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
>