Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] reduce_scatter bug with hierarch
From: Brad Benton (bradford.benton_at_[hidden])
Date: 2009-01-14 15:14:00


r20275 looks good. I suggest that we CMR that into 1.3 and get rc6 rolled
and tested. (actually, Jeff just did the CMR...so off to rc6)
--brad

On Wed, Jan 14, 2009 at 1:16 PM, Edgar Gabriel <gabriel_at_[hidden]> wrote:

> so I am not entirely sure why the bug only happened on trunk, it could in
> theory also appear on v1.3 (is there a difference on how pointer_arrays are
> handled between the two versions?)
>
> Anyway, it passes now on both with changeset 20275. We should probably move
> that over to 1.3 as well, whether for 1.3.0 or 1.3.1 I leave that up to
> others to decide...
>
> Thanks
> Edgar
>
>
> Edgar Gabriel wrote:
>
>> I'm already debugging it. the good news is that it only seems to appear
>> with trunk, with 1.3 (after copying the new tuned module over), all the
>> tests pass.
>>
>> Now if somebody can tell me a trick on how to tell mpirun not kill the
>> debugger under my feet, then I could even see where the problem occurs:-)
>>
>> Thanks
>> Edga
>>
>> George Bosilca wrote:
>>
>>> All these errors are in the MPI_Finalize, it should not be that hard to
>>> find. I'll take a look later this afternoon.
>>>
>>> george.
>>>
>>> On Jan 14, 2009, at 06:41 , Tim Mattox wrote:
>>>
>>> Unfortunately, although this fixed some problems when enabling hierarch
>>>> coll,
>>>> there is still a segfault in two of IU's tests that only shows up when
>>>> we set
>>>> -mca coll_hierarch_priority 100
>>>>
>>>> See this MTT summary to see how the failures improved on the trunk,
>>>> but that there are still two that segfault even at 1.4a1r20267:
>>>> http://www.open-mpi.org/mtt/index.php?do_redir=923
>>>>
>>>> This link just has the remaining failures:
>>>> http://www.open-mpi.org/mtt/index.php?do_redir=922
>>>>
>>>> So, I'll vote for applying the CMR for 1.3 since it clearly improved
>>>> things,
>>>> but there is still more to be done to get coll_hierarch ready for
>>>> regular
>>>> use.
>>>>
>>>> On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca <bosilca_at_[hidden]>
>>>> wrote:
>>>>
>>>>> Here we go by the book :)
>>>>>
>>>>> https://svn.open-mpi.org/trac/ompi/ticket/1749
>>>>>
>>>>> george.
>>>>>
>>>>> On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:
>>>>>
>>>>> Let's debate tomorrow when people are around, but first you have to
>>>>>> file a
>>>>>> CMR... :-)
>>>>>>
>>>>>> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
>>>>>>
>>>>>> Unfortunately, this pinpoint the fact that we didn't test enough the
>>>>>>> collective module mixing thing. I went over the tuned collective
>>>>>>> functions
>>>>>>> and changed all instances to use the correct module information. It
>>>>>>> is now
>>>>>>> on the trunk, revision 20267. Simultaneously,I checked that all other
>>>>>>> collective components do the right thing ... and I have to admit
>>>>>>> tuned was
>>>>>>> the only faulty one.
>>>>>>>
>>>>>>> This is clearly a bug in the tuned, and correcting it will allow
>>>>>>> people
>>>>>>> to use the hierarch. In the current incarnation 1.3 will
>>>>>>> mostly/always
>>>>>>> segfault when hierarch is active. I would prefer not to give a broken
>>>>>>> toy
>>>>>>> out there. How about pushing r20267 in the 1.3?
>>>>>>>
>>>>>>> george.
>>>>>>>
>>>>>>>
>>>>>>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>>>>>>>
>>>>>>> Thanks for digging into this. Can you file a bug? Let's mark it
>>>>>>>> for
>>>>>>>> v1.3.1.
>>>>>>>>
>>>>>>>> I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,
>>>>>>>> and
>>>>>>>> since hierarch isn't currently selected by default (you must
>>>>>>>> specifically
>>>>>>>> elevate hierarch's priority to get it to run), there's no danger
>>>>>>>> that users
>>>>>>>> will run into this problem in default runs.
>>>>>>>>
>>>>>>>> But clearly the problem needs to be fixed, and therefore we need a
>>>>>>>> bug
>>>>>>>> to track it.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:
>>>>>>>>
>>>>>>>> I just debugged the Reduce_scatter bug mentioned previously. The
>>>>>>>>> bug is
>>>>>>>>> unfortunately not in hierarch, but in tuned.
>>>>>>>>>
>>>>>>>>> Here is the code snipplet causing the problems:
>>>>>>>>>
>>>>>>>>> int reduce_scatter (...., mca_coll_base_module_t *module)
>>>>>>>>> {
>>>>>>>>> ...
>>>>>>>>> err = comm->c_coll.coll_reduce (...., module)
>>>>>>>>> ...
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> but should be
>>>>>>>>> {
>>>>>>>>> ...
>>>>>>>>> err = comm->c_coll.coll_reduce (...,
>>>>>>>>> comm->c_coll.coll_reduce_module);
>>>>>>>>> ...
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> The problem as it is right now is, that when using hierarch, only a
>>>>>>>>> subset of the function are set, e.g. reduce,allreduce, bcast and
>>>>>>>>> barrier.
>>>>>>>>> Thus, reduce_scatter is from tuned in most scenarios, and calls the
>>>>>>>>> subsequent functions with the wrong module. Hierarch of course does
>>>>>>>>> not like
>>>>>>>>> that :-)
>>>>>>>>>
>>>>>>>>> Anyway, a quick glance through the tuned code reveals a significant
>>>>>>>>> number of instances where this appears(reduce_scatter, allreduce,
>>>>>>>>> allgather,
>>>>>>>>> allgatherv). Basic, hierarch and inter seem to do that mostly
>>>>>>>>> correctly.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Edgar
>>>>>>>>> --
>>>>>>>>> Edgar Gabriel
>>>>>>>>> Assistant Professor
>>>>>>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>>>>>>>> Department of Computer Science University of Houston
>>>>>>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
>>>>>>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jeff Squyres
>>>>>>>> Cisco Systems
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jeff Squyres
>>>>>> Cisco Systems
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
>>>> tmattox_at_[hidden] || timattox_at_[hidden]
>>>> I'm a bright... http://www.the-brights.net/
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
> --
> Edgar Gabriel
> Assistant Professor
> Parallel Software Technologies Lab http://pstl.cs.uh.edu
> Department of Computer Science University of Houston
> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>