I try a couple of things including your suggestion. I also find out this has
been reported before,
but there seems to be no clear solution so far:
Here is what I observe:
I keep the problem size fixed with 24 processes. I use two nodes with 8-core
each and 2-core each.
1. When it is oversubscribed (12 process/processor), sys vs. user time is
much higher than less-subscribed (1.5 process/processor).
Almost The wall clock does not improve too much :-(
2. I try following options, individually and collectively, no difference
mpirun --mpi_yield_when_idle 1 --mca btl tcp,sm,self --mca
coll_hierarch_priority 100 ...
3. older openmpi version (1.3) seems to be better than new version (1.3.2),
but not significantly.
By the way, I am working on Amazon EC2 (VM-host). Will that make any
On Fri, Jun 26, 2009 at 11:28 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> If you are running fewer processes on your nodes than they have processors,
> then you can improve performance by adding
> -mca mpi_paffinity_alone 1
> to your cmd line. This will bind your processes to individual cores, which
> helps with latency. If your program involves collectives, then you can try
> -mca coll_hierarch_priority 100
> This will activate the hierarchical collectives, which utilize shared
> memory for messages between procs on the same node.
> On Jun 26, 2009, at 9:09 PM, Qiming He wrote:
> Hi all,
>> I am new to OpenMPI, and have an urgent run-time question. I have
>> openmpi-1.3.2 compiled with Intel Fortran compiler v.11 simply by
>> ./configure --prefix=<my-dir> F77=ifort FC=ifort
>> then I set my LD_LIBRARY_PATH to include <openmpi-lib> and <intel-lib>
>> and compile my Fortran program properly. No compilation error.
>> I run my program on single node. Everything looks ok. However, when I run
>> it on multiple nodes.
>> mpirun -np <num> --hostfile <my-hosts> <my-program>
>> The performance is much worse than a single node with the same size of the
>> problem to solve (MPICH2 has 50% improvement)
>> I use top and saidar to find that user time (CPU user) is much lower than
>> system time (CPU system), i.e,
>> only small portion of CPU time is used by user application, while the rest
>> is busy with system.
>> No wonder I got bad performance. I am assuming "CPU system" is used for
>> MPI communication.
>> I notice the total traffic (on eth0) is not that big (~5Mb/sec). What is
>> CPU system busy for?
>> Can anyone help? Anything I need to tune?
>> Thanks in advance
>> users mailing list
> users mailing list