Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] reduce_scatter bug with hierarch
From: George Bosilca (bosilca_at_[hidden])
Date: 2009-01-14 00:15:09


Here we go by the book :)

https://svn.open-mpi.org/trac/ompi/ticket/1749

   george.

On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:

> Let's debate tomorrow when people are around, but first you have to
> file a CMR... :-)
>
> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
>
>> Unfortunately, this pinpoint the fact that we didn't test enough
>> the collective module mixing thing. I went over the tuned
>> collective functions and changed all instances to use the correct
>> module information. It is now on the trunk, revision 20267.
>> Simultaneously,I checked that all other collective components do
>> the right thing ... and I have to admit tuned was the only faulty
>> one.
>>
>> This is clearly a bug in the tuned, and correcting it will allow
>> people to use the hierarch. In the current incarnation 1.3 will
>> mostly/always segfault when hierarch is active. I would prefer not
>> to give a broken toy out there. How about pushing r20267 in the 1.3?
>>
>> george.
>>
>>
>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>>
>>> Thanks for digging into this. Can you file a bug? Let's mark it
>>> for v1.3.1.
>>>
>>> I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,
>>> and since hierarch isn't currently selected by default (you must
>>> specifically elevate hierarch's priority to get it to run),
>>> there's no danger that users will run into this problem in default
>>> runs.
>>>
>>> But clearly the problem needs to be fixed, and therefore we need a
>>> bug to track it.
>>>
>>>
>>>
>>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:
>>>
>>>> I just debugged the Reduce_scatter bug mentioned previously. The
>>>> bug is unfortunately not in hierarch, but in tuned.
>>>>
>>>> Here is the code snipplet causing the problems:
>>>>
>>>> int reduce_scatter (...., mca_coll_base_module_t *module)
>>>> {
>>>> ...
>>>> err = comm->c_coll.coll_reduce (...., module)
>>>> ...
>>>> }
>>>>
>>>>
>>>> but should be
>>>> {
>>>> ...
>>>> err = comm->c_coll.coll_reduce (..., comm-
>>>> >c_coll.coll_reduce_module);
>>>> ...
>>>> }
>>>>
>>>> The problem as it is right now is, that when using hierarch, only
>>>> a subset of the function are set, e.g. reduce,allreduce, bcast
>>>> and barrier. Thus, reduce_scatter is from tuned in most
>>>> scenarios, and calls the subsequent functions with the wrong
>>>> module. Hierarch of course does not like that :-)
>>>>
>>>> Anyway, a quick glance through the tuned code reveals a
>>>> significant number of instances where this
>>>> appears(reduce_scatter, allreduce, allgather, allgatherv). Basic,
>>>> hierarch and inter seem to do that mostly correctly.
>>>>
>>>> Thanks
>>>> Edgar
>>>> --
>>>> Edgar Gabriel
>>>> Assistant Professor
>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>>> Department of Computer Science University of Houston
>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel