Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [RFC] mca_base_select()
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-05-06 17:58:47


Sorry about that. Looking back at the filem logic it seems that I
returned success even if select failed (and just use the 'none'
passthrough component). I committed a patch in r18389 that fixes this
problem.

This commit now has a warning that prints on the filem verbose stream
so if a user hits something like this in the wild unexpectedly then
we can help them debug it a bit.

Cheers,
Josh

On May 6, 2008, at 2:56 PM, Ralph H Castain wrote:

> Hmmm....well, I hit a problem (of course!). I have mca-no-build on
> the filem
> framework on my Mac. If I just mpriun -n 3 ./hello, I get the
> following
> error:
>
> ----------------------------------------------------------------------
> ----
> It looks like orte_init failed for some reason; your parallel
> process is
> likely to abort. There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
> orte_filem_base_select failed
> --> Returned value Error (-1) instead of ORTE_SUCCESS
>
> ----------------------------------------------------------------------
> ----
>
> After looking at the source code for filem_select, I can run just
> fine if I
> specify -mca filem none on the cmd line. Otherwise, it looks like your
> select logic insists that at least one component must be built and
> selectable?
>
> Is that generally true, or is your filem framework the exception? I
> think
> this would not be a good general requirement - frankly, I don't
> think it is
> good for any framework to have such a requirement.
>
> Ralph
>
>
>
> On 5/6/08 12:09 PM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:
>
>> This has been committed in r18381
>>
>> Please let me know if you have any problems with this commit.
>>
>> Cheers,
>> Josh
>>
>> On May 5, 2008, at 10:41 AM, Josh Hursey wrote:
>>
>>> Awesome.
>>>
>>> The branch is updated to the latest trunk head. I encourage folks to
>>> check out this repository and make sure that it builds on their
>>> system. A normal build of the branch should be enough to find out if
>>> there are any cut-n-paste problems (though I tried to be careful,
>>> mistakes do happen).
>>>
>>> I haven't heard any problems so this is looking like it will come in
>>> tomorrow after the teleconf. I'll ask again there to see if there
>>> are
>>> any voices of concern.
>>>
>>> Cheers,
>>> Josh
>>>
>>> On May 5, 2008, at 9:58 AM, Jeff Squyres wrote:
>>>
>>>> This all sounds good to me!
>>>>
>>>> On Apr 29, 2008, at 6:35 PM, Josh Hursey wrote:
>>>>
>>>>> What: Add mca_base_select() and adjust frameworks & components to
>>>>> use
>>>>> it.
>>>>> Why: Consolidation of code for general goodness.
>>>>> Where: https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play
>>>>> When: Code ready now. Documentation ready soon.
>>>>> Timeout: May 6, 2008 (After teleconf) [1 week]
>>>>>
>>>>> Discussion:
>>>>> -----------
>>>>> For a number of years a few developers have been talking about
>>>>> creating a MCA base component selection function. For various
>>>>> reasons
>>>>> this was never implemented. Recently I decided to give it a try.
>>>>>
>>>>> A base select function will allow Open MPI to provide completely
>>>>> consistent selection behavior for many of its frameworks (18 of 31
>>>>> to
>>>>> be exact at the moment). The primary goal of this work is to
>>>>> improving
>>>>> code maintainability through code reuse. Other benefits also
>>>>> result
>>>>> such as a slightly smaller memory footprint.
>>>>>
>>>>> The mca_base_select() function represented the most commonly used
>>>>> logic for component selection: Select the one component with the
>>>>> highest priority and close all of the not selected components.
>>>>> This
>>>>> function can be found at the path below in the branch:
>>>>> opal/mca/base/mca_base_components_select.c
>>>>>
>>>>> To support this I had to formalize a query() function in the
>>>>> mca_base_component_t of the form:
>>>>> int mca_base_query_component_fn(mca_base_module_t **module, int
>>>>> *priority);
>>>>>
>>>>> This function is specified after the open and close component
>>>>> functions in this structure as to allow compatibility with
>>>>> frameworks
>>>>> that do not use the base selection logic. Frameworks that do *not*
>>>>> use
>>>>> this function are *not* effected by this commit. However, every
>>>>> component in the frameworks that use the mca_base_select function
>>>>> must
>>>>> adjust their component query function to fit that specified above.
>>>>>
>>>>> 18 frameworks in Open MPI have been changed. I have updated all of
>>>>> the
>>>>> components in the 18 frameworks available in the trunk on my
>>>>> branch.
>>>>> The effected frameworks are:
>>>>> - OPAL Carto
>>>>> - OPAL crs
>>>>> - OPAL maffinity
>>>>> - OPAL memchecker
>>>>> - OPAL paffinity
>>>>> - ORTE errmgr
>>>>> - ORTE ess
>>>>> - ORTE Filem
>>>>> - ORTE grpcomm
>>>>> - ORTE odls
>>>>> - ORTE pml
>>>>> - ORTE ras
>>>>> - ORTE rmaps
>>>>> - ORTE routed
>>>>> - ORTE snapc
>>>>> - OMPI crcp
>>>>> - OMPI dpm
>>>>> - OMPI pubsub
>>>>>
>>>>> There was a question of the memory footprint change as a result of
>>>>> this commit. I used 'pmap' to determine process memory footprint
>>>>> of a
>>>>> hello world MPI program. Static and Shared build numbers are below
>>>>> along with variations on launching locally and to a single node
>>>>> allocated by SLURM. All of this was on Indiana University's Odin
>>>>> machine. We compare against the trunk (r18276) representing the
>>>>> last
>>>>> SVN sync point of the branch.
>>>>>
>>>>> Process(shared)| Trunk | Branch | Diff (Improvement)
>>>>> ---------------+----------+---------+-------
>>>>> mpirun (orted) | 39976K | 36828K | 3148K
>>>>> hello (0) | 229288K | 229268K | 20K
>>>>> hello (1) | 229288K | 229268K | 20K
>>>>> ---------------+----------+---------+-------
>>>>> mpirun | 40032K | 37924K | 2108K
>>>>> orted | 34720K | 34660K | 60K
>>>>> hello (0) | 228404K | 228384K | 20K
>>>>> hello (1) | 228404K | 228384K | 20K
>>>>>
>>>>> Process(static)| Trunk | Branch | Diff (Improvement)
>>>>> ---------------+----------+---------+-------
>>>>> mpirun (orted) | 21384K | 21372K | 12K
>>>>> hello (0) | 194000K | 193980K | 20K
>>>>> hello (1) | 194000K | 193980K | 20K
>>>>> ---------------+----------+---------+-------
>>>>> mpirun | 21384K | 21372K | 12K
>>>>> orted | 21208K | 21196K | 12K
>>>>> hello (0) | 193116K | 193096K | 20K
>>>>> hello (1) | 193116K | 193096K | 20K
>>>>>
>>>>> As you can see there are some small memory footprint
>>>>> improvements on
>>>>> my branch that result from this work. The size of the Open MPI
>>>>> project
>>>>> shrinks a bit as well. This commit cuts between 3,500 and 2,000
>>>>> lines
>>>>> of code (depending on how you count) so about a ~1% code shrink.
>>>>>
>>>>> The branch is stable in all of the testing I have done, but there
>>>>> are
>>>>> some platforms on which I cannot test. So please give this
>>>>> branch a
>>>>> try and let me know if you find any problems.
>>>>>
>>>>> Cheers,
>>>>> Josh
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> Cisco Systems
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel