I guess too much optimization always bites back :) In few words here
is the description of the problem. The PML is event based, each action
is triggered either by a function call from the upper level or a
callback from the lower one. The last set of optimizations on the PML/
BTL remove the this callback in some cases, and therefore let the PML
in a state where it is unable to do any progress. In this particular
test (and the problem is not necessarily related to SM, it's just that
we didn't find the right number of pending to trigger it over others
BTL), the test execute a set of isend, followed by a blocking send.
The isend are sent over SM, and as do not have progress in the isend,
we fill up the SM queue. When the blocking send get posted, it will be
delayed (as there is no more place in the SM file), and will be added
by the PML to the pending send queue. So far, so good. Except, that at
this point we return from the PML function, and go in the condition.
The condition will call the BML progress functions, but as there is no
callbacks to the PML, the PML is unable to reschedule the send.
This didn't happens until recently, but it was pure luck. Before there
was a pending queue in the SM BTL, and eventually the message got sent
at one point, without involving the PML. Anyway, as I said before the
problem could happens with any other BTL, if we post the right number
of non-blocking sends.
Here is the solution I propose. If you think there is any problem with
it, please let me know asap.
Move the progress function from the BML layer back into the PML. Then
the PML will have a way to check on it's pending requests, and
progress them accordingly. This solution offer the same number of
function calls as what we have today, and should only minimally affect
the performances (one more if in the progress function).
On Jun 25, 2008, at 4:06 AM, Lenny Verkhovsky wrote:
> I downloaded new version from trunk and got the fallowing
> 1. opal_output for no reason ( probaly something was forgotten )
> 2. it got stacked.
> /home/USERS/lenny/OMPI_ORTE_TRUNK/bin/mpirun -np 2 -hostfile
> hostfile_w4_8 ./osu_bw
> [witch4:20920] Using eager rdma: 1
> [witch4:20921] Using eager rdma: 1
> # OSU MPI Bandwidth Test (Version 2.1)
> # Size Bandwidth (MB/s)
> ( got stacked )
> devel mailing list
- application/pkcs7-signature attachment: smime.p7s