I recently switched to OpenMPI (v1.1.1) from LAM/MPI.  My application runs at approximately 1/4th the speed of the same program running under LAM.  Let me explain my setup.

The program is executed as 16 processes on 8 dual-processor Apple Xserve Nodes with one gigabit card (per node) interfaced to a gigabit switch.  The application requires communication for every 1 ms of model time (under LAM the program used to run slightly faster than realtime).   When communication occurs every process needs information from each of the other processes.  The information that needs to be transmitted from any given process varies from one int (4 bytes) to about 1200-1500 bytes (just under one normal ethernet frame).  Jumbo frames are not supported by the switch.  The case of 4-50 bytes happens often > 80 % of the time.

The communication scheme that I had generated to compress the traffic was this.  Each node transfers data from the higher ranked process on that node to the other via shared memory.  Then the lower ranked processes from each node communicate in a treed round-robin scheme (to avoid contention for resources [the nic] and minimise traffic).  See pseudo code below.  Then the lower ranked process on each node tells the higher rank process via shared memory.  Under both LAM and OpenMPI the processes are distributed "--byslots."  And, yes this scheme was ~3x faster than a Alltoallv or AllGatherv under LAM.  One more point first, the transfers were partioned into packets of 1500 bytes at each stage and padded if necessary.  

Pseudocode for tree'd round-robbin scheme:

// share on node first
if (mpi_rank % 2 == 0) {
MPI_recv();
Merge_current_info_with_new_info;
} else {
MPI_send();
}
// share between nodes
for (i = 1; i < ceil(log2(mpi_size)); i++) {
share_partner = mpi_rank ^ (1 << i);
if (share_partner < mpip_size) { // does partner exist?
MPI_isend();
MPI_irecv();
MPI_Waitall();
Merge_current_info_with_new_info;

}
// share on node afterward
if (mpi_rank % 2 == 0) {
MPI_send();
} else {
MPI_recv(); 
}


I know this is a detailed email, but it is important I resolve this (the faster the model runs, the faster I graduate).  One more interesting tidbit, under LAM this scaled up to the 8 nodes (linear scaling up to 4 nodes) for this program.  OpenMPI performance is just about all the same beyond 1 node (2 processes). 

Thanks for any help!!!

Karl Dockendorf
Research Fellow
Department of Biomedical Engineering
University of Florida