Let's debate tomorrow when people are around, but first you have to
file a CMR... :-)
On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
> Unfortunately, this pinpoint the fact that we didn't test enough the
> collective module mixing thing. I went over the tuned collective
> functions and changed all instances to use the correct module
> information. It is now on the trunk, revision 20267.
> Simultaneously,I checked that all other collective components do the
> right thing ... and I have to admit tuned was the only faulty one.
> This is clearly a bug in the tuned, and correcting it will allow
> people to use the hierarch. In the current incarnation 1.3 will
> mostly/always segfault when hierarch is active. I would prefer not
> to give a broken toy out there. How about pushing r20267 in the 1.3?
> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>> Thanks for digging into this. Can you file a bug? Let's mark it
>> for v1.3.1.
>> I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,
>> and since hierarch isn't currently selected by default (you must
>> specifically elevate hierarch's priority to get it to run), there's
>> no danger that users will run into this problem in default runs.
>> But clearly the problem needs to be fixed, and therefore we need a
>> bug to track it.
>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:
>>> I just debugged the Reduce_scatter bug mentioned previously. The
>>> bug is unfortunately not in hierarch, but in tuned.
>>> Here is the code snipplet causing the problems:
>>> int reduce_scatter (...., mca_coll_base_module_t *module)
>>> err = comm->c_coll.coll_reduce (...., module)
>>> but should be
>>> err = comm->c_coll.coll_reduce (..., comm-
>>> The problem as it is right now is, that when using hierarch, only
>>> a subset of the function are set, e.g. reduce,allreduce, bcast and
>>> barrier. Thus, reduce_scatter is from tuned in most scenarios, and
>>> calls the subsequent functions with the wrong module. Hierarch of
>>> course does not like that :-)
>>> Anyway, a quick glance through the tuned code reveals a
>>> significant number of instances where this appears(reduce_scatter,
>>> allreduce, allgather, allgatherv). Basic, hierarch and inter seem
>>> to do that mostly correctly.
>>> Edgar Gabriel
>>> Assistant Professor
>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>> Department of Computer Science University of Houston
>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>>> devel mailing list
>> Jeff Squyres
>> Cisco Systems
>> devel mailing list
> devel mailing list