Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Andrew Friedley (afriedle_at_[hidden])
Date: 2007-09-21 12:19:31


Thanks George. I figured out the problem (two of them actually) based
on a pointer from Gleb (thanks Gleb). I have two types of send queues
on the UD BTL -- one is per-module, and the other is per-endpoint. I
had missed looking for stuck frags on the per-endpoint queues.

So something is wrong with the per-endpoint queues and their interaction
with the per-module queue. Disabling the per-endpoint queue makes the
problem go away, and I'm not sure I liked having them in the first place.

But this still left a similar problem at 2kb messages. I had static
limits set for free list lengths based on the btl_ofud_sd_num MCA
parameter. Switching the max to unlimited makes this problem go away
too. Good enough to get some runs through for now :)

Andrew

George Bosilca wrote:
> Andrew,
>
> There is an option on the message queue stuff, that allow you to see all
> internal pending requests. On the current trunk, edit the file
> ompi/debuggers/ompi_dll.s at line 736 and set the
> p_info->show_internal_requests to 1. Now compile and install it, and
> then restart totalview. You should be able to get access to all pending
> requests, even those created by the collective modules.
>
> Moreover, the missing sends should be somewhere. If they are not in the
> BTL, and i they are not completed, then hopefully they are in the PML in
> the send_pending list. As the collective works on all other BTL I
> suppose the communication pattern is correct, so there is something
> happening with the requests when using the UD BTL.
>
> If the requests are not in the PML send_pending queue, the next thing
> you can do is to modify the receive handles in the OB1 PML, and print
> all incoming match header. You will have to somehow sort the output, but
> at least you can figure out, what is happening with the missing messages.
>
> george.
>
> On Sep 11, 2007, at 12:37 PM, Andrew Friedley wrote:
>
>> First off, I've managed to reproduce this with nbcbench using only 16
>> procs (two per node), and setting btl_ofud_sd_num to 12 -- eases
>> debugging with fewer procs to look at.
>>
>> ompi_coll_tuned_alltoall_intra_basic_linear is the alltoall routine that
>> is being called. What I'm seeing from totalview is that some random
>> number of procs (1-5 usually, varies from run to run) are sitting with a
>> send and a recv outstanding to every other proc. The other procs
>> however have moved on to the next collective. This is hard to see with
>> the default nbcbench code since it calls only alltoall repeatedly --
>> adding a barrier after the MPI_Alltoall() call makes it easier to see,
>> as the barrier has a different tag number and communication pattern. So
>> what I see is a few procs stuck in alltoall, while the rest are waiting
>> in the following barrier.
>>
>> I've also verified with totalview that there are no outstanding send
>> wqe's at the UD BTL, and all procs are polling progress. The procs in
>> the alltoall are polling in the opal_condition_wait() called from
>> ompi_request_wait_all().
>>
>> Not sure what to ask or where to look further other than, what should I
>> look at to see what requests are outstanding in the PML?
>>
>> Andrew
>>
>> George Bosilca wrote:
>>> The first step will be to figure out which version of the alltoall
>>> you're using. I suppose you use the default parameters, and then the
>>> decision function in the tuned component say it is using the linear
>>> all to all. As the name state it, this means that every node will
>>> post one receive from any other node and then will start sending to
>>> every other node the respective fragment. This will lead to a lot of
>>> outstanding sends and receives. I doubt that the receive can cause a
>>> problem, so I expect the problem is coming from the send side.
>>>
>>> Do you have TotalView installed on your odin ? If yes there is a
>>> simple way to see how many sends are pending and where ... That might
>>> pinpoint [at least] the process where you should look to see what'
>>> wrong.
>>>
>>> george.
>>>
>>> On Aug 29, 2007, at 12:37 AM, Andrew Friedley wrote:
>>>
>>>> I'm having a problem with the UD BTL and hoping someone might have
>>>> some
>>>> input to help solve it.
>>>>
>>>> What I'm seeing is hangs when running alltoall benchmarks with
>>>> nbcbench
>>>> or an LLNL program called mpiBench -- both hang exactly the same way.
>>>> With the code on the trunk running nbcbench on IU's odin using 32
>>>> nodes
>>>> and a command line like this:
>>>>
>>>> mpirun -np 128 -mca btl ofud,self ./nbcbench -t MPI_Alltoall -p
>>>> 128-128
>>>> -s 1-262144
>>>>
>>>> hangs consistently when testing 256-byte messages. There are two
>>>> things
>>>> I can do to make the hang go away until running at larger scale.
>>>> First
>>>> is to increase the 'btl_ofud_sd_num' MCA param from its default
>>>> value of
>>>> 128. This allows you to run with more procs/nodes before hitting the
>>>> hang, but AFAICT doesn't fix the actual problem. What this parameter
>>>> does is control the maximum number of outstanding send WQEs posted at
>>>> the IB level -- when the limit is reached, frags are queued on an
>>>> opal_list_t and later sent by progress as IB sends complete.
>>>>
>>>> The other way I've found is to play games with calling
>>>> mca_btl_ud_component_progress() in mca_btl_ud_endpoint_post_send
>>>> (). In
>>>> fact I replaced the CHECK_FRAG_QUEUES() macro used around
>>>> btl_ofud_endpoint.c:77 with a version that loops on progress until a
>>>> send WQE slot is available (as opposed to queueing). Same result -- I
>>>> can run at larger scale, but still hit the hang eventually.
>>>>
>>>> It appears that when the job hangs, progress is being polled very
>>>> quickly, and after spinning for a while there are no outstanding send
>>>> WQEs or queued sends in the BTL. I'm not sure where further up things
>>>> are spinning/blocking, as I can't produce the hang at less than 32
>>>> nodes
>>>> / 128 procs and don't have a good way of debugging that (suggestions
>>>> appreciated).
>>>>
>>>> Furthermore, both ob1 and dr PMLs result in the same behavior, except
>>>> that DR eventually trips a watchdog timeout, fails the BTL, and
>>>> terminates the job.
>>>>
>>>> Other collectives such as allreduce and allgather do not hang -- only
>>>> alltoall. I can also reproduce the hang on LLNL's Atlas machine.
>>>>
>>>> Can anyone else reproduce this (Torsten might have to make a copy of
>>>> nbcbench available)? Anyone have any ideas as to what's wrong?
>>>>
>>>> Andrew
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel