George,

Thanks for the tips. It looks like using "-bynode" as opposed to "-byslot" is the best way to distribute processes when running Amber9's Sander module. I confirmed that with MPICH-MX as well. I didn't realize that these settings were available. This really helps because I was getting bummed that I would just have to keep various hostfiles around some with slots=XX and some with nothing but the hostname.

Just an FYI on the timings:

-bynode:
real    0m35.035s

-byslot:
real    0m44.856s


Warner Yuen

Scientific Computing Consultant


On Mar 29, 2007, at 9:00 AM, users-request@open-mpi.org wrote:

Message: 1

Date: Wed, 28 Mar 2007 12:19:15 -0400

From: George Bosilca <bosilca@cs.utk.edu>

Subject: Re: [OMPI users] Odd behavior with slots=4

To: Open MPI Users <users@open-mpi.org>

Message-ID: <2A58CF38-0FC4-4289-85E1-315376540F63@cs.utk.edu>

Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed


There are multiple answers possible here. One is related to the over- 

subscription of your cluster, but I expect that there are at least 4  

cores per node if you want to use the slots=4 option. The real  

question is what is the communication pattern in this benchmark ? and  

how this match the distribution of the processes you use ?


As a matter of fact, if when you have XX processes per node, and all  

of them will try to send a message to a remote process (here remote  

means on another node), then they will have to share the physical  

Myrinet link, which of course will lead to lower global performances  

when XX increase (from 1, to 2 and then 4). And this is true without  

regard on how you use the MX driver (via the Open MPI MTL or BTL).


Open MPI provide 2 options to allow you to distribute the processes  

based on different criteria. Try to use -bynode and -byslot to see if  

this affect the overall performances.


   Thanks,

     george.


On Mar 28, 2007, at 9:56 AM, Warner Yuen wrote:


Curious performance when using OpenMPI 1.2 to run Amber 9 on my  

Xserve Xeon 5100 cluster. Each cluster node is a dual socket, dual- 

core system. The cluster is also running with Myrinet 2000 with MX.  

I'm just running some tests with one of Amber's benchmarks.


It seems that my hostfiles effect the performance of the  

application. I tried variations of the hostfile to see what would  

happen. I did a straight mpirun with no mca options set using:  

"mpirun -np 32"


variation 1: hostname

real    0m35.391s


variation 2: hostname slots=4

real    0m45.698s


variation 3: hostname slots=2

real    0m38.761s



It seems that the best performance I achieve is when I use  

variation 1 with only the hostname and execute the command:

 "mpirun --hostfile hostfile -np 32 <my_application>" . Its  

shockingly about 13% better performance than if I use the hostfile  

with a syntax of "hostname slots=4".


I also tried variations of in my mpirun command, here are the times:


straight mpirun with not mca options

real    0m45.698s


and....


"-mca mpi_yield_when_idle 0"

real    0m44.912s


and....


 "-mca mtl mx -mca pml cm"

real    0m45.002s