I got some free time (yeh haw) and took a look at the OB1 PML in order
to fix the issue. I think I found the problem, as I'm unable to
reproduce this error. Can you please give it a try with 20946 and
20947 but without 20944?
On Apr 6, 2009, at 14:49 , Eugene Loh wrote:
> This strikes me as very reasonable. That is, the PML should be
> fixed, but to keep the issue from being a 1.3.2 blocker we should
> bump the mpool_sm_min_size default back up again so that 1.3.2 is no
> worse than 1.3.1.
> I put back SVN r20944 with this change. osu_bw now runs (for me).
> I filed CMR 1870 to add this change to the 1.3.2 branch. I guess I
> need a code review. Could someone review the code for r20944 and
> annotate the CMR? It's a one-line/several-character change that
> bumps the min default size from 0 to 64M.
> At this point, I assume 1857 is no longer a blocker, but in the long
> term the PML should be fixed.
> Lenny Verkhovsky wrote:
>> Changing default value is an easy fix. This fix will not add new
>> possible bugs/dead locks/pathes where noone has gone before on the
>> PML level.
>> This fix can be added to Open MPI 1.3 that currently is blocked due
>> to OSU failure.
>> PML fix can be done later (IMHO)
>> On Sat, Apr 4, 2009 at 1:46 AM, Eugene Loh <Eugene.Loh_at_[hidden]>
>> What's next on this ticket? It's supposed to be a blocker. Again,
>> the issue is that osu_bw deluges a receiver with rendezvous
>> messages, but the receiver does not have enough eager frags to
>> acknowledge them all. We see this now that the sizing of the mmap
>> file has changed and there's less headroom to grow the free lists.
>> Possible fixes are:
>> A) Just make the mmap file default size larger (though less
>> overkill than we used to have).
>> B) Fix the PML code that is supposed to deal with cases like this.
>> (At least I think the PML has code that's intended for this purpose.)
>> Eugene Loh wrote:
>> In osu_bw, process 0 pumps lots of Isend's to process 1, and
>> process 1 in turn sets up lots of matching Irecvs. Many messages
>> are in flight. The question is what happens when resources are
>> exhausted and OMPI cannot handle so much in-flight traffic. Let's
>> specifically consider the case of long, rendezvous messages. There
>> are at least two situations.
>> 1) When the sender no longer has any fragments (nor can grow its
>> free list any more), it queues a send up with
>> add_request_to_send_pending() and somehow life is good. The PML
>> seems to handle this case "correctly".
>> 2) When the receiver -- specifically
>> mca_pml_ob1_recv_request_ack_send_btl() -- no longer has any
>> fragments to send ACKs back to confirm readiness for rendezvous,
>> the resource-exhaustion signal travels up the call stack to
>> mca_pml_ob1_recv_request_ack_send(), who does a
>> MCA_PML_OB1_ADD_ACK_TO_PENDING(). In short, the PML adds the ACK
>> to pckt_pending. Somehow, this code path doesn't work.
>> The reason we see the problem now is that I added "autosizing" of
>> the shared-memory area. We used to mmap *WAY* too much shared-
>> memory for small-np jobs. (Yes, that's a subjective statement.)
>> Meanwhile, at large-np, we didn't mmap enough and jobs wouldn't
>> start. (Objective statement there.) So, I added heuristics to
>> size the shared area "appropriately". The heuristics basically
>> targetted the needs of MPI_Init(). If you want fragment free lists
>> to grow on demand after MPI_Init(), you now basically have to bump
>> mpool_sm_min_size up explicitly.
>> I'd like feedback on a fix. Here are two options:
>> A) Someone (could be I) increases the default resources. E.g., we
>> could start with a larger eager free list. Or, I could change
>> those "heuristics" to allow some amount of headroom for free lists
>> to grow on demand. Either way, I'd appreciate feedback on how big
>> to set these things.
>> B) Someone (not I, since I don't know how) fixes the ob1 PML to
>> handle scenario 2 above correctly.
> devel mailing list