As I understand, When MPI_Iprobe is called, the code that is called is the function pointed by the attribute
In the file ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c (Open-MPI 1.4.3),
ompi_crcp_bkmrk_pml_iprobe calls drain_message_find_any.
In drain_message_find_any (in ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c), there is a loop over all MPI ranks
regardless of the peer parameter.
For instance, with 256 peers, probing for peer 255 requires 256 iterations while probing for peer 0 requires 1 iteration.
As I understand it, the linked list ompi_crcp_bkmrk_pml_peer_refs is populated with nprocs entries where nprocs is presumably the number of MPI ranks in MPI_COMM_WORLD.
If my understanding is right, here are some suggestions:
1. ompi_crcp_bkmrk_pml_peer_refs should be an array so that when peer is not MPI_ANY_SOURCE, MPI_Iprobe can returns in constant time.
2. There should be some sort of round-robin mechanism for the case where the peer is MPI_ANY_SOURCE, otherwise lower ranks will get more probed and higher ranks will
suffer from starvation. This could be done by having a current position in the peer list (or array, see point 1). Instead of starting to loop on the first, the loop would start at current position and
a maximum of nprocs iterations would take place.
A code review is on my blog: http://dskernel.blogspot.com/2011/09/code-review-what-happens-in-open-mpis.html