I agree with Bill that performance portability is an issue.  That is, the MPI standard itself doesn't really provide any guarantees here about what is fastest.  Perhaps polling this mailing list will be helpful, but if you are looking for "the fastest" solution regardless of which MPI implementation you use (and which interconnect you use... which might be determined at run time) you will probably be disappointed.
 
 
I was afraid that was going to be the case :-( So if I'm concerned about being network-BW-bound and being performant on different types of architectures I should design my application such that it is able to use any of the above communication patterns such that I can tell the app which comm-pattern to use depending on the middleware/hardware it will run on (to optimise for BW usage) ?
 
toon