Thanks George. I figured out the problem (two of them actually) based
on a pointer from Gleb (thanks Gleb). I have two types of send queues
on the UD BTL -- one is per-module, and the other is per-endpoint. I
had missed looking for stuck frags on the per-endpoint queues.
So something is wrong with the per-endpoint queues and their interaction
with the per-module queue. Disabling the per-endpoint queue makes the
problem go away, and I'm not sure I liked having them in the first place.
But this still left a similar problem at 2kb messages. I had static
limits set for free list lengths based on the btl_ofud_sd_num MCA
parameter. Switching the max to unlimited makes this problem go away
too. Good enough to get some runs through for now :)
George Bosilca wrote:
> There is an option on the message queue stuff, that allow you to see all
> internal pending requests. On the current trunk, edit the file
> ompi/debuggers/ompi_dll.s at line 736 and set the
> p_info->show_internal_requests to 1. Now compile and install it, and
> then restart totalview. You should be able to get access to all pending
> requests, even those created by the collective modules.
> Moreover, the missing sends should be somewhere. If they are not in the
> BTL, and i they are not completed, then hopefully they are in the PML in
> the send_pending list. As the collective works on all other BTL I
> suppose the communication pattern is correct, so there is something
> happening with the requests when using the UD BTL.
> If the requests are not in the PML send_pending queue, the next thing
> you can do is to modify the receive handles in the OB1 PML, and print
> all incoming match header. You will have to somehow sort the output, but
> at least you can figure out, what is happening with the missing messages.
> On Sep 11, 2007, at 12:37 PM, Andrew Friedley wrote:
>> First off, I've managed to reproduce this with nbcbench using only 16
>> procs (two per node), and setting btl_ofud_sd_num to 12 -- eases
>> debugging with fewer procs to look at.
>> ompi_coll_tuned_alltoall_intra_basic_linear is the alltoall routine that
>> is being called. What I'm seeing from totalview is that some random
>> number of procs (1-5 usually, varies from run to run) are sitting with a
>> send and a recv outstanding to every other proc. The other procs
>> however have moved on to the next collective. This is hard to see with
>> the default nbcbench code since it calls only alltoall repeatedly --
>> adding a barrier after the MPI_Alltoall() call makes it easier to see,
>> as the barrier has a different tag number and communication pattern. So
>> what I see is a few procs stuck in alltoall, while the rest are waiting
>> in the following barrier.
>> I've also verified with totalview that there are no outstanding send
>> wqe's at the UD BTL, and all procs are polling progress. The procs in
>> the alltoall are polling in the opal_condition_wait() called from
>> Not sure what to ask or where to look further other than, what should I
>> look at to see what requests are outstanding in the PML?
>> George Bosilca wrote:
>>> The first step will be to figure out which version of the alltoall
>>> you're using. I suppose you use the default parameters, and then the
>>> decision function in the tuned component say it is using the linear
>>> all to all. As the name state it, this means that every node will
>>> post one receive from any other node and then will start sending to
>>> every other node the respective fragment. This will lead to a lot of
>>> outstanding sends and receives. I doubt that the receive can cause a
>>> problem, so I expect the problem is coming from the send side.
>>> Do you have TotalView installed on your odin ? If yes there is a
>>> simple way to see how many sends are pending and where ... That might
>>> pinpoint [at least] the process where you should look to see what'
>>> On Aug 29, 2007, at 12:37 AM, Andrew Friedley wrote:
>>>> I'm having a problem with the UD BTL and hoping someone might have
>>>> input to help solve it.
>>>> What I'm seeing is hangs when running alltoall benchmarks with
>>>> or an LLNL program called mpiBench -- both hang exactly the same way.
>>>> With the code on the trunk running nbcbench on IU's odin using 32
>>>> and a command line like this:
>>>> mpirun -np 128 -mca btl ofud,self ./nbcbench -t MPI_Alltoall -p
>>>> -s 1-262144
>>>> hangs consistently when testing 256-byte messages. There are two
>>>> I can do to make the hang go away until running at larger scale.
>>>> is to increase the 'btl_ofud_sd_num' MCA param from its default
>>>> value of
>>>> 128. This allows you to run with more procs/nodes before hitting the
>>>> hang, but AFAICT doesn't fix the actual problem. What this parameter
>>>> does is control the maximum number of outstanding send WQEs posted at
>>>> the IB level -- when the limit is reached, frags are queued on an
>>>> opal_list_t and later sent by progress as IB sends complete.
>>>> The other way I've found is to play games with calling
>>>> mca_btl_ud_component_progress() in mca_btl_ud_endpoint_post_send
>>>> (). In
>>>> fact I replaced the CHECK_FRAG_QUEUES() macro used around
>>>> btl_ofud_endpoint.c:77 with a version that loops on progress until a
>>>> send WQE slot is available (as opposed to queueing). Same result -- I
>>>> can run at larger scale, but still hit the hang eventually.
>>>> It appears that when the job hangs, progress is being polled very
>>>> quickly, and after spinning for a while there are no outstanding send
>>>> WQEs or queued sends in the BTL. I'm not sure where further up things
>>>> are spinning/blocking, as I can't produce the hang at less than 32
>>>> / 128 procs and don't have a good way of debugging that (suggestions
>>>> Furthermore, both ob1 and dr PMLs result in the same behavior, except
>>>> that DR eventually trips a watchdog timeout, fails the BTL, and
>>>> terminates the job.
>>>> Other collectives such as allreduce and allgather do not hang -- only
>>>> alltoall. I can also reproduce the hang on LLNL's Atlas machine.
>>>> Can anyone else reproduce this (Torsten might have to make a copy of
>>>> nbcbench available)? Anyone have any ideas as to what's wrong?
>>>> devel mailing list
>>> devel mailing list
>> devel mailing list
> devel mailing list