Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] reduce_scatter bug with hierarch
From: George Bosilca (bosilca_at_[hidden])
Date: 2009-01-14 00:15:09


Here we go by the book :)

https://svn.open-mpi.org/trac/ompi/ticket/1749

   george.

On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:

> Let's debate tomorrow when people are around, but first you have to
> file a CMR... :-)
>
> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
>
>> Unfortunately, this pinpoint the fact that we didn't test enough
>> the collective module mixing thing. I went over the tuned
>> collective functions and changed all instances to use the correct
>> module information. It is now on the trunk, revision 20267.
>> Simultaneously,I checked that all other collective components do
>> the right thing ... and I have to admit tuned was the only faulty
>> one.
>>
>> This is clearly a bug in the tuned, and correcting it will allow
>> people to use the hierarch. In the current incarnation 1.3 will
>> mostly/always segfault when hierarch is active. I would prefer not
>> to give a broken toy out there. How about pushing r20267 in the 1.3?
>>
>> george.
>>
>>
>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>>
>>> Thanks for digging into this. Can you file a bug? Let's mark it
>>> for v1.3.1.
>>>
>>> I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,
>>> and since hierarch isn't currently selected by default (you must
>>> specifically elevate hierarch's priority to get it to run),
>>> there's no danger that users will run into this problem in default
>>> runs.
>>>
>>> But clearly the problem needs to be fixed, and therefore we need a
>>> bug to track it.
>>>
>>>
>>>
>>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:
>>>
>>>> I just debugged the Reduce_scatter bug mentioned previously. The
>>>> bug is unfortunately not in hierarch, but in tuned.
>>>>
>>>> Here is the code snipplet causing the problems:
>>>>
>>>> int reduce_scatter (...., mca_coll_base_module_t *module)
>>>> {
>>>> ...
>>>> err = comm->c_coll.coll_reduce (...., module)
>>>> ...
>>>> }
>>>>
>>>>
>>>> but should be
>>>> {
>>>> ...
>>>> err = comm->c_coll.coll_reduce (..., comm-
>>>> >c_coll.coll_reduce_module);
>>>> ...
>>>> }
>>>>
>>>> The problem as it is right now is, that when using hierarch, only
>>>> a subset of the function are set, e.g. reduce,allreduce, bcast
>>>> and barrier. Thus, reduce_scatter is from tuned in most
>>>> scenarios, and calls the subsequent functions with the wrong
>>>> module. Hierarch of course does not like that :-)
>>>>
>>>> Anyway, a quick glance through the tuned code reveals a
>>>> significant number of instances where this
>>>> appears(reduce_scatter, allreduce, allgather, allgatherv). Basic,
>>>> hierarch and inter seem to do that mostly correctly.
>>>>
>>>> Thanks
>>>> Edgar
>>>> --
>>>> Edgar Gabriel
>>>> Assistant Professor
>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>>> Department of Computer Science University of Houston
>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel