Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] reduce_scatter bug with hierarch
From: George Bosilca (bosilca_at_[hidden])
Date: 2009-01-14 13:11:35


In case a parallel debugger is not available, I'm using mpirun
*blahblah* xterm -e gdb [app_name] and this works pretty well as long
as the ssh forward the X11 display.

   Hope this helps,
     george.

On Jan 14, 2009, at 13:04 , Edgar Gabriel wrote:

> I'm already debugging it. the good news is that it only seems to
> appear with trunk, with 1.3 (after copying the new tuned module
> over), all the tests pass.
>
> Now if somebody can tell me a trick on how to tell mpirun not kill
> the debugger under my feet, then I could even see where the problem
> occurs:-)
>
> Thanks
> Edga
>
> George Bosilca wrote:
>> All these errors are in the MPI_Finalize, it should not be that
>> hard to find. I'll take a look later this afternoon.
>> george.
>> On Jan 14, 2009, at 06:41 , Tim Mattox wrote:
>>> Unfortunately, although this fixed some problems when enabling
>>> hierarch coll,
>>> there is still a segfault in two of IU's tests that only shows up
>>> when we set
>>> -mca coll_hierarch_priority 100
>>>
>>> See this MTT summary to see how the failures improved on the trunk,
>>> but that there are still two that segfault even at 1.4a1r20267:
>>> http://www.open-mpi.org/mtt/index.php?do_redir=923
>>>
>>> This link just has the remaining failures:
>>> http://www.open-mpi.org/mtt/index.php?do_redir=922
>>>
>>> So, I'll vote for applying the CMR for 1.3 since it clearly
>>> improved things,
>>> but there is still more to be done to get coll_hierarch ready for
>>> regular
>>> use.
>>>
>>> On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca <bosilca_at_[hidden]
>>> > wrote:
>>>> Here we go by the book :)
>>>>
>>>> https://svn.open-mpi.org/trac/ompi/ticket/1749
>>>>
>>>> george.
>>>>
>>>> On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:
>>>>
>>>>> Let's debate tomorrow when people are around, but first you have
>>>>> to file a
>>>>> CMR... :-)
>>>>>
>>>>> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
>>>>>
>>>>>> Unfortunately, this pinpoint the fact that we didn't test
>>>>>> enough the
>>>>>> collective module mixing thing. I went over the tuned
>>>>>> collective functions
>>>>>> and changed all instances to use the correct module
>>>>>> information. It is now
>>>>>> on the trunk, revision 20267. Simultaneously,I checked that all
>>>>>> other
>>>>>> collective components do the right thing ... and I have to
>>>>>> admit tuned was
>>>>>> the only faulty one.
>>>>>>
>>>>>> This is clearly a bug in the tuned, and correcting it will
>>>>>> allow people
>>>>>> to use the hierarch. In the current incarnation 1.3 will mostly/
>>>>>> always
>>>>>> segfault when hierarch is active. I would prefer not to give a
>>>>>> broken toy
>>>>>> out there. How about pushing r20267 in the 1.3?
>>>>>>
>>>>>> george.
>>>>>>
>>>>>>
>>>>>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>>>>>>
>>>>>>> Thanks for digging into this. Can you file a bug? Let's mark
>>>>>>> it for
>>>>>>> v1.3.1.
>>>>>>>
>>>>>>> I say 1.3.1 instead of 1.3.0 because this *only* affects
>>>>>>> hierarch, and
>>>>>>> since hierarch isn't currently selected by default (you must
>>>>>>> specifically
>>>>>>> elevate hierarch's priority to get it to run), there's no
>>>>>>> danger that users
>>>>>>> will run into this problem in default runs.
>>>>>>>
>>>>>>> But clearly the problem needs to be fixed, and therefore we
>>>>>>> need a bug
>>>>>>> to track it.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:
>>>>>>>
>>>>>>>> I just debugged the Reduce_scatter bug mentioned previously.
>>>>>>>> The bug is
>>>>>>>> unfortunately not in hierarch, but in tuned.
>>>>>>>>
>>>>>>>> Here is the code snipplet causing the problems:
>>>>>>>>
>>>>>>>> int reduce_scatter (...., mca_coll_base_module_t *module)
>>>>>>>> {
>>>>>>>> ...
>>>>>>>> err = comm->c_coll.coll_reduce (...., module)
>>>>>>>> ...
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> but should be
>>>>>>>> {
>>>>>>>> ...
>>>>>>>> err = comm->c_coll.coll_reduce (..., comm-
>>>>>>>> >c_coll.coll_reduce_module);
>>>>>>>> ...
>>>>>>>> }
>>>>>>>>
>>>>>>>> The problem as it is right now is, that when using hierarch,
>>>>>>>> only a
>>>>>>>> subset of the function are set, e.g. reduce,allreduce, bcast
>>>>>>>> and barrier.
>>>>>>>> Thus, reduce_scatter is from tuned in most scenarios, and
>>>>>>>> calls the
>>>>>>>> subsequent functions with the wrong module. Hierarch of
>>>>>>>> course does not like
>>>>>>>> that :-)
>>>>>>>>
>>>>>>>> Anyway, a quick glance through the tuned code reveals a
>>>>>>>> significant
>>>>>>>> number of instances where this appears(reduce_scatter,
>>>>>>>> allreduce, allgather,
>>>>>>>> allgatherv). Basic, hierarch and inter seem to do that mostly
>>>>>>>> correctly.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Edgar
>>>>>>>> --
>>>>>>>> Edgar Gabriel
>>>>>>>> Assistant Professor
>>>>>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>>>>>>> Department of Computer Science University of Houston
>>>>>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
>>>>>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jeff Squyres
>>>>>>> Cisco Systems
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> Cisco Systems
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>>
>>>
>>> --
>>> Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
>>> tmattox_at_[hidden] || timattox_at_[hidden]
>>> I'm a bright... http://www.the-brights.net/
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> --
> Edgar Gabriel
> Assistant Professor
> Parallel Software Technologies Lab http://pstl.cs.uh.edu
> Department of Computer Science University of Houston
> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel