Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [RFC] mca_base_select()
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-05-12 21:32:42


I -think- I may have found the problem here, but don't have a real test case
- try r18429 and see if it works.

On 5/11/08 4:32 PM, "Josh Hursey" <jjhursey_at_[hidden]> wrote:

> From the stacktrace, this doesn't look like a problem with
> base_select, but with 'orte_util_encode_pidmap'. You may want to
> start looking there.
>
> -- Josh
>
> On May 11, 2008, at 1:30 PM, Lenny Verkhovsky wrote:
>
>> Hi,
>> I tried r 18423 with rank_file component and got seqfault
>> ( I increase priority of the component if rmaps_rank_file_path exist)
>>
>>
>> /home/USERS/lenny/OMPI_ORTE_SMD/bin/mpirun -np 4 -hostfile
>> hostfile_ompi -mca rmaps_rank_file_path rankfile -mca
>> paffinity_base_verbose 5 ./mpi_p_SMD -t bw -output 1 -order 1
>> [witch1:25456] mca:base:select: Querying component [linux]
>> [witch1:25456] mca:base:select: Query of component [linux] set
>> priority to 10
>> [witch1:25456] mca:base:select: Selected component [linux]
>> [witch1:25456] *** Process received signal ***
>> [witch1:25456] Signal: Segmentation fault (11)
>> [witch1:25456] Signal code: Invalid permissions (2)
>> [witch1:25456] Failing at address: 0x2b2875530030
>> [witch1:25456] [ 0] /lib64/libpthread.so.0 [0x2b28759dfc10]
>> [witch1:25456] [ 1] /home/USERS/lenny/OMPI_ORTE_SMD/lib/libopen-
>> pal.so.0 [0x2b28753e2bb6]
>> [witch1:25456] [ 2] /home/USERS/lenny/OMPI_ORTE_SMD/lib/libopen-
>> pal.so.0 [0x2b28753e23b6]
>> [witch1:25456] [ 3] /home/USERS/lenny/OMPI_ORTE_SMD/lib/libopen-
>> pal.so.0 [0x2b28753e22fd]
>> [witch1:25456] [ 4] /home/USERS/lenny/OMPI_ORTE_SMD/lib/libopen-
>> rte.so.0(orte_util_encode_pidmap+0x2f4) [0x2b287527f412]
>> [witch1:25456] [ 5] /home/USERS/lenny/OMPI_ORTE_SMD/lib/libopen-
>> rte.so.0(orte_odls_base_default_get_add_procs_data+0x989)
>> [0x2b28752934f5]
>> [witch1:25456] [ 6] /home/USERS/lenny/OMPI_ORTE_SMD/lib/libopen-
>> rte.so.0(orte_plm_base_launch_apps+0x1a3) [0x2b287529e60b]
>> [witch1:25456] [ 7] /home/USERS/lenny/OMPI_ORTE_SMD/lib/openmpi/
>> mca_plm_rsh.so [0x2b287612f788]
>> [witch1:25456] [ 8] /home/USERS/lenny/OMPI_ORTE_SMD/bin/mpirun
>> [0x4032bf]
>> [witch1:25456] [ 9] /home/USERS/lenny/OMPI_ORTE_SMD/bin/mpirun
>> [0x402b53]
>> [witch1:25456] [10] /lib64/libc.so.6(__libc_start_main+0xf4)
>> [0x2b2875b06154]
>> [witch1:25456] [11] /home/USERS/lenny/OMPI_ORTE_SMD/bin/mpirun
>> [0x402aa9]
>> [witch1:25456] *** End of error message ***
>> Segmentation fault
>>
>>
>>
>>
>> On Tue, May 6, 2008 at 9:09 PM, Josh Hursey <jjhursey_at_[hidden]>
>> wrote:
>> This has been committed in r18381
>>
>> Please let me know if you have any problems with this commit.
>>
>> Cheers,
>> Josh
>>
>> On May 5, 2008, at 10:41 AM, Josh Hursey wrote:
>>
>>> Awesome.
>>>
>>> The branch is updated to the latest trunk head. I encourage folks to
>>> check out this repository and make sure that it builds on their
>>> system. A normal build of the branch should be enough to find out if
>>> there are any cut-n-paste problems (though I tried to be careful,
>>> mistakes do happen).
>>>
>>> I haven't heard any problems so this is looking like it will come in
>>> tomorrow after the teleconf. I'll ask again there to see if there
>> are
>>> any voices of concern.
>>>
>>> Cheers,
>>> Josh
>>>
>>> On May 5, 2008, at 9:58 AM, Jeff Squyres wrote:
>>>
>>>> This all sounds good to me!
>>>>
>>>> On Apr 29, 2008, at 6:35 PM, Josh Hursey wrote:
>>>>
>>>>> What: Add mca_base_select() and adjust frameworks & components to
>>>>> use
>>>>> it.
>>>>> Why: Consolidation of code for general goodness.
>>>>> Where: https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play
>>>>> When: Code ready now. Documentation ready soon.
>>>>> Timeout: May 6, 2008 (After teleconf) [1 week]
>>>>>
>>>>> Discussion:
>>>>> -----------
>>>>> For a number of years a few developers have been talking about
>>>>> creating a MCA base component selection function. For various
>>>>> reasons
>>>>> this was never implemented. Recently I decided to give it a try.
>>>>>
>>>>> A base select function will allow Open MPI to provide completely
>>>>> consistent selection behavior for many of its frameworks (18 of 31
>>>>> to
>>>>> be exact at the moment). The primary goal of this work is to
>>>>> improving
>>>>> code maintainability through code reuse. Other benefits also
>> result
>>>>> such as a slightly smaller memory footprint.
>>>>>
>>>>> The mca_base_select() function represented the most commonly used
>>>>> logic for component selection: Select the one component with the
>>>>> highest priority and close all of the not selected components.
>> This
>>>>> function can be found at the path below in the branch:
>>>>> opal/mca/base/mca_base_components_select.c
>>>>>
>>>>> To support this I had to formalize a query() function in the
>>>>> mca_base_component_t of the form:
>>>>> int mca_base_query_component_fn(mca_base_module_t **module, int
>>>>> *priority);
>>>>>
>>>>> This function is specified after the open and close component
>>>>> functions in this structure as to allow compatibility with
>>>>> frameworks
>>>>> that do not use the base selection logic. Frameworks that do *not*
>>>>> use
>>>>> this function are *not* effected by this commit. However, every
>>>>> component in the frameworks that use the mca_base_select function
>>>>> must
>>>>> adjust their component query function to fit that specified above.
>>>>>
>>>>> 18 frameworks in Open MPI have been changed. I have updated all of
>>>>> the
>>>>> components in the 18 frameworks available in the trunk on my
>> branch.
>>>>> The effected frameworks are:
>>>>> - OPAL Carto
>>>>> - OPAL crs
>>>>> - OPAL maffinity
>>>>> - OPAL memchecker
>>>>> - OPAL paffinity
>>>>> - ORTE errmgr
>>>>> - ORTE ess
>>>>> - ORTE Filem
>>>>> - ORTE grpcomm
>>>>> - ORTE odls
>>>>> - ORTE pml
>>>>> - ORTE ras
>>>>> - ORTE rmaps
>>>>> - ORTE routed
>>>>> - ORTE snapc
>>>>> - OMPI crcp
>>>>> - OMPI dpm
>>>>> - OMPI pubsub
>>>>>
>>>>> There was a question of the memory footprint change as a result of
>>>>> this commit. I used 'pmap' to determine process memory footprint
>>>>> of a
>>>>> hello world MPI program. Static and Shared build numbers are below
>>>>> along with variations on launching locally and to a single node
>>>>> allocated by SLURM. All of this was on Indiana University's Odin
>>>>> machine. We compare against the trunk (r18276) representing the
>> last
>>>>> SVN sync point of the branch.
>>>>>
>>>>> Process(shared)| Trunk | Branch | Diff (Improvement)
>>>>> ---------------+----------+---------+-------
>>>>> mpirun (orted) | 39976K | 36828K | 3148K
>>>>> hello (0) | 229288K | 229268K | 20K
>>>>> hello (1) | 229288K | 229268K | 20K
>>>>> ---------------+----------+---------+-------
>>>>> mpirun | 40032K | 37924K | 2108K
>>>>> orted | 34720K | 34660K | 60K
>>>>> hello (0) | 228404K | 228384K | 20K
>>>>> hello (1) | 228404K | 228384K | 20K
>>>>>
>>>>> Process(static)| Trunk | Branch | Diff (Improvement)
>>>>> ---------------+----------+---------+-------
>>>>> mpirun (orted) | 21384K | 21372K | 12K
>>>>> hello (0) | 194000K | 193980K | 20K
>>>>> hello (1) | 194000K | 193980K | 20K
>>>>> ---------------+----------+---------+-------
>>>>> mpirun | 21384K | 21372K | 12K
>>>>> orted | 21208K | 21196K | 12K
>>>>> hello (0) | 193116K | 193096K | 20K
>>>>> hello (1) | 193116K | 193096K | 20K
>>>>>
>>>>> As you can see there are some small memory footprint
>> improvements on
>>>>> my branch that result from this work. The size of the Open MPI
>>>>> project
>>>>> shrinks a bit as well. This commit cuts between 3,500 and 2,000
>>>>> lines
>>>>> of code (depending on how you count) so about a ~1% code shrink.
>>>>>
>>>>> The branch is stable in all of the testing I have done, but there
>>>>> are
>>>>> some platforms on which I cannot test. So please give this
>> branch a
>>>>> try and let me know if you find any problems.
>>>>>
>>>>> Cheers,
>>>>> Josh
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> Cisco Systems
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel