Changing default value is an easy fix. This fix will not add new possible
bugs/dead locks/pathes where noone has gone before on the PML level.
This fix can be added to Open MPI 1.3 that currently is blocked due to OSU
PML fix can be done later (IMHO)
On Sat, Apr 4, 2009 at 1:46 AM, Eugene Loh <Eugene.Loh_at_[hidden]> wrote:
> What's next on this ticket? It's supposed to be a blocker. Again, the
> issue is that osu_bw deluges a receiver with rendezvous messages, but the
> receiver does not have enough eager frags to acknowledge them all. We see
> this now that the sizing of the mmap file has changed and there's less
> headroom to grow the free lists. Possible fixes are:
> A) Just make the mmap file default size larger (though less overkill than
> we used to have).
> B) Fix the PML code that is supposed to deal with cases like this. (At
> least I think the PML has code that's intended for this purpose.)
> Eugene Loh wrote:
> In osu_bw, process 0 pumps lots of Isend's to process 1, and process 1 in
>> turn sets up lots of matching Irecvs. Many messages are in flight. The
>> question is what happens when resources are exhausted and OMPI cannot handle
>> so much in-flight traffic. Let's specifically consider the case of long,
>> rendezvous messages. There are at least two situations.
>> 1) When the sender no longer has any fragments (nor can grow its free list
>> any more), it queues a send up with add_request_to_send_pending() and
>> somehow life is good. The PML seems to handle this case "correctly".
>> 2) When the receiver -- specifically
>> mca_pml_ob1_recv_request_ack_send_btl() -- no longer has any fragments to
>> send ACKs back to confirm readiness for rendezvous, the resource-exhaustion
>> signal travels up the call stack to mca_pml_ob1_recv_request_ack_send(), who
>> does a MCA_PML_OB1_ADD_ACK_TO_PENDING(). In short, the PML adds the ACK to
>> pckt_pending. Somehow, this code path doesn't work.
>> The reason we see the problem now is that I added "autosizing" of the
>> shared-memory area. We used to mmap *WAY* too much shared-memory for
>> small-np jobs. (Yes, that's a subjective statement.) Meanwhile, at
>> large-np, we didn't mmap enough and jobs wouldn't start. (Objective
>> statement there.) So, I added heuristics to size the shared area
>> "appropriately". The heuristics basically targetted the needs of
>> MPI_Init(). If you want fragment free lists to grow on demand after
>> MPI_Init(), you now basically have to bump mpool_sm_min_size up explicitly.
>> I'd like feedback on a fix. Here are two options:
>> A) Someone (could be I) increases the default resources. E.g., we could
>> start with a larger eager free list. Or, I could change those "heuristics"
>> to allow some amount of headroom for free lists to grow on demand. Either
>> way, I'd appreciate feedback on how big to set these things.
>> B) Someone (not I, since I don't know how) fixes the ob1 PML to handle
>> scenario 2 above correctly.
> devel mailing list