Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] reduce_scatter bug with hierarch
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-01-13 23:40:22


Let's debate tomorrow when people are around, but first you have to
file a CMR... :-)

On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:

> Unfortunately, this pinpoint the fact that we didn't test enough the
> collective module mixing thing. I went over the tuned collective
> functions and changed all instances to use the correct module
> information. It is now on the trunk, revision 20267.
> Simultaneously,I checked that all other collective components do the
> right thing ... and I have to admit tuned was the only faulty one.
>
> This is clearly a bug in the tuned, and correcting it will allow
> people to use the hierarch. In the current incarnation 1.3 will
> mostly/always segfault when hierarch is active. I would prefer not
> to give a broken toy out there. How about pushing r20267 in the 1.3?
>
> george.
>
>
> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>
>> Thanks for digging into this. Can you file a bug? Let's mark it
>> for v1.3.1.
>>
>> I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,
>> and since hierarch isn't currently selected by default (you must
>> specifically elevate hierarch's priority to get it to run), there's
>> no danger that users will run into this problem in default runs.
>>
>> But clearly the problem needs to be fixed, and therefore we need a
>> bug to track it.
>>
>>
>>
>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:
>>
>>> I just debugged the Reduce_scatter bug mentioned previously. The
>>> bug is unfortunately not in hierarch, but in tuned.
>>>
>>> Here is the code snipplet causing the problems:
>>>
>>> int reduce_scatter (...., mca_coll_base_module_t *module)
>>> {
>>> ...
>>> err = comm->c_coll.coll_reduce (...., module)
>>> ...
>>> }
>>>
>>>
>>> but should be
>>> {
>>> ...
>>> err = comm->c_coll.coll_reduce (..., comm-
>>> >c_coll.coll_reduce_module);
>>> ...
>>> }
>>>
>>> The problem as it is right now is, that when using hierarch, only
>>> a subset of the function are set, e.g. reduce,allreduce, bcast and
>>> barrier. Thus, reduce_scatter is from tuned in most scenarios, and
>>> calls the subsequent functions with the wrong module. Hierarch of
>>> course does not like that :-)
>>>
>>> Anyway, a quick glance through the tuned code reveals a
>>> significant number of instances where this appears(reduce_scatter,
>>> allreduce, allgather, allgatherv). Basic, hierarch and inter seem
>>> to do that mostly correctly.
>>>
>>> Thanks
>>> Edgar
>>> --
>>> Edgar Gabriel
>>> Assistant Professor
>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>> Department of Computer Science University of Houston
>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems