Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] reduce_scatter bug with hierarch
From: George Bosilca (bosilca_at_[hidden])
Date: 2009-01-14 13:01:36


All these errors are in the MPI_Finalize, it should not be that hard
to find. I'll take a look later this afternoon.

   george.

On Jan 14, 2009, at 06:41 , Tim Mattox wrote:

> Unfortunately, although this fixed some problems when enabling
> hierarch coll,
> there is still a segfault in two of IU's tests that only shows up
> when we set
> -mca coll_hierarch_priority 100
>
> See this MTT summary to see how the failures improved on the trunk,
> but that there are still two that segfault even at 1.4a1r20267:
> http://www.open-mpi.org/mtt/index.php?do_redir=923
>
> This link just has the remaining failures:
> http://www.open-mpi.org/mtt/index.php?do_redir=922
>
> So, I'll vote for applying the CMR for 1.3 since it clearly improved
> things,
> but there is still more to be done to get coll_hierarch ready for
> regular
> use.
>
> On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca
> <bosilca_at_[hidden]> wrote:
>> Here we go by the book :)
>>
>> https://svn.open-mpi.org/trac/ompi/ticket/1749
>>
>> george.
>>
>> On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:
>>
>>> Let's debate tomorrow when people are around, but first you have
>>> to file a
>>> CMR... :-)
>>>
>>> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
>>>
>>>> Unfortunately, this pinpoint the fact that we didn't test enough
>>>> the
>>>> collective module mixing thing. I went over the tuned collective
>>>> functions
>>>> and changed all instances to use the correct module information.
>>>> It is now
>>>> on the trunk, revision 20267. Simultaneously,I checked that all
>>>> other
>>>> collective components do the right thing ... and I have to admit
>>>> tuned was
>>>> the only faulty one.
>>>>
>>>> This is clearly a bug in the tuned, and correcting it will allow
>>>> people
>>>> to use the hierarch. In the current incarnation 1.3 will mostly/
>>>> always
>>>> segfault when hierarch is active. I would prefer not to give a
>>>> broken toy
>>>> out there. How about pushing r20267 in the 1.3?
>>>>
>>>> george.
>>>>
>>>>
>>>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>>>>
>>>>> Thanks for digging into this. Can you file a bug? Let's mark
>>>>> it for
>>>>> v1.3.1.
>>>>>
>>>>> I say 1.3.1 instead of 1.3.0 because this *only* affects
>>>>> hierarch, and
>>>>> since hierarch isn't currently selected by default (you must
>>>>> specifically
>>>>> elevate hierarch's priority to get it to run), there's no danger
>>>>> that users
>>>>> will run into this problem in default runs.
>>>>>
>>>>> But clearly the problem needs to be fixed, and therefore we need
>>>>> a bug
>>>>> to track it.
>>>>>
>>>>>
>>>>>
>>>>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:
>>>>>
>>>>>> I just debugged the Reduce_scatter bug mentioned previously.
>>>>>> The bug is
>>>>>> unfortunately not in hierarch, but in tuned.
>>>>>>
>>>>>> Here is the code snipplet causing the problems:
>>>>>>
>>>>>> int reduce_scatter (...., mca_coll_base_module_t *module)
>>>>>> {
>>>>>> ...
>>>>>> err = comm->c_coll.coll_reduce (...., module)
>>>>>> ...
>>>>>> }
>>>>>>
>>>>>>
>>>>>> but should be
>>>>>> {
>>>>>> ...
>>>>>> err = comm->c_coll.coll_reduce (..., comm-
>>>>>> >c_coll.coll_reduce_module);
>>>>>> ...
>>>>>> }
>>>>>>
>>>>>> The problem as it is right now is, that when using hierarch,
>>>>>> only a
>>>>>> subset of the function are set, e.g. reduce,allreduce, bcast
>>>>>> and barrier.
>>>>>> Thus, reduce_scatter is from tuned in most scenarios, and calls
>>>>>> the
>>>>>> subsequent functions with the wrong module. Hierarch of course
>>>>>> does not like
>>>>>> that :-)
>>>>>>
>>>>>> Anyway, a quick glance through the tuned code reveals a
>>>>>> significant
>>>>>> number of instances where this appears(reduce_scatter,
>>>>>> allreduce, allgather,
>>>>>> allgatherv). Basic, hierarch and inter seem to do that mostly
>>>>>> correctly.
>>>>>>
>>>>>> Thanks
>>>>>> Edgar
>>>>>> --
>>>>>> Edgar Gabriel
>>>>>> Assistant Professor
>>>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>>>>> Department of Computer Science University of Houston
>>>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
>>>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> Cisco Systems
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
>
> --
> Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
> tmattox_at_[hidden] || timattox_at_[hidden]
> I'm a bright... http://www.the-brights.net/
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel