Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] reduce_scatter bug with hierarch
From: Brad Benton (bradford.benton_at_[hidden])
Date: 2009-01-14 13:15:09


So, if it looks okay on 1.3...then there should not be anything holding up
the release, right? Otherwise, George we need to decide on whether or not
this is a blocker, or if we go ahead and release with this as a known issue
and schedule the fix for 1.3.1. My vote is to go ahead and release, but if
you (or others) think otherwise, let's talk about how best to move forward.
--brad

On Wed, Jan 14, 2009 at 12:04 PM, Edgar Gabriel <gabriel_at_[hidden]> wrote:

> I'm already debugging it. the good news is that it only seems to appear
> with trunk, with 1.3 (after copying the new tuned module over), all the
> tests pass.
>
> Now if somebody can tell me a trick on how to tell mpirun not kill the
> debugger under my feet, then I could even see where the problem occurs:-)
>
> Thanks
> Edga
>
>
> George Bosilca wrote:
>
>> All these errors are in the MPI_Finalize, it should not be that hard to
>> find. I'll take a look later this afternoon.
>>
>> george.
>>
>> On Jan 14, 2009, at 06:41 , Tim Mattox wrote:
>>
>> Unfortunately, although this fixed some problems when enabling hierarch
>>> coll,
>>> there is still a segfault in two of IU's tests that only shows up when we
>>> set
>>> -mca coll_hierarch_priority 100
>>>
>>> See this MTT summary to see how the failures improved on the trunk,
>>> but that there are still two that segfault even at 1.4a1r20267:
>>> http://www.open-mpi.org/mtt/index.php?do_redir=923
>>>
>>> This link just has the remaining failures:
>>> http://www.open-mpi.org/mtt/index.php?do_redir=922
>>>
>>> So, I'll vote for applying the CMR for 1.3 since it clearly improved
>>> things,
>>> but there is still more to be done to get coll_hierarch ready for regular
>>> use.
>>>
>>> On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca <bosilca_at_[hidden]>
>>> wrote:
>>>
>>>> Here we go by the book :)
>>>>
>>>> https://svn.open-mpi.org/trac/ompi/ticket/1749
>>>>
>>>> george.
>>>>
>>>> On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:
>>>>
>>>> Let's debate tomorrow when people are around, but first you have to
>>>>> file a
>>>>> CMR... :-)
>>>>>
>>>>> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
>>>>>
>>>>> Unfortunately, this pinpoint the fact that we didn't test enough the
>>>>>> collective module mixing thing. I went over the tuned collective
>>>>>> functions
>>>>>> and changed all instances to use the correct module information. It is
>>>>>> now
>>>>>> on the trunk, revision 20267. Simultaneously,I checked that all other
>>>>>> collective components do the right thing ... and I have to admit tuned
>>>>>> was
>>>>>> the only faulty one.
>>>>>>
>>>>>> This is clearly a bug in the tuned, and correcting it will allow
>>>>>> people
>>>>>> to use the hierarch. In the current incarnation 1.3 will mostly/always
>>>>>> segfault when hierarch is active. I would prefer not to give a broken
>>>>>> toy
>>>>>> out there. How about pushing r20267 in the 1.3?
>>>>>>
>>>>>> george.
>>>>>>
>>>>>>
>>>>>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>>>>>>
>>>>>> Thanks for digging into this. Can you file a bug? Let's mark it for
>>>>>>> v1.3.1.
>>>>>>>
>>>>>>> I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,
>>>>>>> and
>>>>>>> since hierarch isn't currently selected by default (you must
>>>>>>> specifically
>>>>>>> elevate hierarch's priority to get it to run), there's no danger that
>>>>>>> users
>>>>>>> will run into this problem in default runs.
>>>>>>>
>>>>>>> But clearly the problem needs to be fixed, and therefore we need a
>>>>>>> bug
>>>>>>> to track it.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:
>>>>>>>
>>>>>>> I just debugged the Reduce_scatter bug mentioned previously. The bug
>>>>>>>> is
>>>>>>>> unfortunately not in hierarch, but in tuned.
>>>>>>>>
>>>>>>>> Here is the code snipplet causing the problems:
>>>>>>>>
>>>>>>>> int reduce_scatter (...., mca_coll_base_module_t *module)
>>>>>>>> {
>>>>>>>> ...
>>>>>>>> err = comm->c_coll.coll_reduce (...., module)
>>>>>>>> ...
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> but should be
>>>>>>>> {
>>>>>>>> ...
>>>>>>>> err = comm->c_coll.coll_reduce (...,
>>>>>>>> comm->c_coll.coll_reduce_module);
>>>>>>>> ...
>>>>>>>> }
>>>>>>>>
>>>>>>>> The problem as it is right now is, that when using hierarch, only a
>>>>>>>> subset of the function are set, e.g. reduce,allreduce, bcast and
>>>>>>>> barrier.
>>>>>>>> Thus, reduce_scatter is from tuned in most scenarios, and calls the
>>>>>>>> subsequent functions with the wrong module. Hierarch of course does
>>>>>>>> not like
>>>>>>>> that :-)
>>>>>>>>
>>>>>>>> Anyway, a quick glance through the tuned code reveals a significant
>>>>>>>> number of instances where this appears(reduce_scatter, allreduce,
>>>>>>>> allgather,
>>>>>>>> allgatherv). Basic, hierarch and inter seem to do that mostly
>>>>>>>> correctly.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Edgar
>>>>>>>> --
>>>>>>>> Edgar Gabriel
>>>>>>>> Assistant Professor
>>>>>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>>>>>>> Department of Computer Science University of Houston
>>>>>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
>>>>>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jeff Squyres
>>>>>>> Cisco Systems
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> Cisco Systems
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>
>>>
>>> --
>>> Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
>>> tmattox_at_[hidden] || timattox_at_[hidden]
>>> I'm a bright... http://www.the-brights.net/
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> --
> Edgar Gabriel
> Assistant Professor
> Parallel Software Technologies Lab http://pstl.cs.uh.edu
> Department of Computer Science University of Houston
> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>