Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] reduce_scatter bug with hierarch
From: Tim Mattox (timattox_at_[hidden])
Date: 2009-01-14 06:41:34


Unfortunately, although this fixed some problems when enabling hierarch coll,
there is still a segfault in two of IU's tests that only shows up when we set
-mca coll_hierarch_priority 100

See this MTT summary to see how the failures improved on the trunk,
but that there are still two that segfault even at 1.4a1r20267:
http://www.open-mpi.org/mtt/index.php?do_redir=923

This link just has the remaining failures:
http://www.open-mpi.org/mtt/index.php?do_redir=922

So, I'll vote for applying the CMR for 1.3 since it clearly improved things,
but there is still more to be done to get coll_hierarch ready for regular
use.

On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca <bosilca_at_[hidden]> wrote:
> Here we go by the book :)
>
> https://svn.open-mpi.org/trac/ompi/ticket/1749
>
> george.
>
> On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:
>
>> Let's debate tomorrow when people are around, but first you have to file a
>> CMR... :-)
>>
>> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
>>
>>> Unfortunately, this pinpoint the fact that we didn't test enough the
>>> collective module mixing thing. I went over the tuned collective functions
>>> and changed all instances to use the correct module information. It is now
>>> on the trunk, revision 20267. Simultaneously,I checked that all other
>>> collective components do the right thing ... and I have to admit tuned was
>>> the only faulty one.
>>>
>>> This is clearly a bug in the tuned, and correcting it will allow people
>>> to use the hierarch. In the current incarnation 1.3 will mostly/always
>>> segfault when hierarch is active. I would prefer not to give a broken toy
>>> out there. How about pushing r20267 in the 1.3?
>>>
>>> george.
>>>
>>>
>>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>>>
>>>> Thanks for digging into this. Can you file a bug? Let's mark it for
>>>> v1.3.1.
>>>>
>>>> I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, and
>>>> since hierarch isn't currently selected by default (you must specifically
>>>> elevate hierarch's priority to get it to run), there's no danger that users
>>>> will run into this problem in default runs.
>>>>
>>>> But clearly the problem needs to be fixed, and therefore we need a bug
>>>> to track it.
>>>>
>>>>
>>>>
>>>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:
>>>>
>>>>> I just debugged the Reduce_scatter bug mentioned previously. The bug is
>>>>> unfortunately not in hierarch, but in tuned.
>>>>>
>>>>> Here is the code snipplet causing the problems:
>>>>>
>>>>> int reduce_scatter (...., mca_coll_base_module_t *module)
>>>>> {
>>>>> ...
>>>>> err = comm->c_coll.coll_reduce (...., module)
>>>>> ...
>>>>> }
>>>>>
>>>>>
>>>>> but should be
>>>>> {
>>>>> ...
>>>>> err = comm->c_coll.coll_reduce (..., comm->c_coll.coll_reduce_module);
>>>>> ...
>>>>> }
>>>>>
>>>>> The problem as it is right now is, that when using hierarch, only a
>>>>> subset of the function are set, e.g. reduce,allreduce, bcast and barrier.
>>>>> Thus, reduce_scatter is from tuned in most scenarios, and calls the
>>>>> subsequent functions with the wrong module. Hierarch of course does not like
>>>>> that :-)
>>>>>
>>>>> Anyway, a quick glance through the tuned code reveals a significant
>>>>> number of instances where this appears(reduce_scatter, allreduce, allgather,
>>>>> allgatherv). Basic, hierarch and inter seem to do that mostly correctly.
>>>>>
>>>>> Thanks
>>>>> Edgar
>>>>> --
>>>>> Edgar Gabriel
>>>>> Assistant Professor
>>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>>>> Department of Computer Science University of Houston
>>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
>>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> Cisco Systems
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmattox_at_[hidden] || timattox_at_[hidden]
    I'm a bright... http://www.the-brights.net/