If you are running fewer processes on your nodes than they have
processors, then you can improve performance by adding
-mca mpi_paffinity_alone 1
to your cmd line. This will bind your processes to individual cores,
which helps with latency. If your program involves collectives, then
you can try setting
-mca coll_hierarch_priority 100
This will activate the hierarchical collectives, which utilize shared
memory for messages between procs on the same node.
On Jun 26, 2009, at 9:09 PM, Qiming He wrote:
> Hi all,
> I am new to OpenMPI, and have an urgent run-time question. I have
> openmpi-1.3.2 compiled with Intel Fortran compiler v.11 simply by
> ./configure --prefix=<my-dir> F77=ifort FC=ifort
> then I set my LD_LIBRARY_PATH to include <openmpi-lib> and <intel-lib>
> and compile my Fortran program properly. No compilation error.
> I run my program on single node. Everything looks ok. However, when
> I run it on multiple nodes.
> mpirun -np <num> --hostfile <my-hosts> <my-program>
> The performance is much worse than a single node with the same size
> of the problem to solve (MPICH2 has 50% improvement)
> I use top and saidar to find that user time (CPU user) is much lower
> than system time (CPU system), i.e,
> only small portion of CPU time is used by user application, while
> the rest is busy with system.
> No wonder I got bad performance. I am assuming "CPU system" is used
> for MPI communication.
> I notice the total traffic (on eth0) is not that big (~5Mb/sec).
> What is CPU system busy for?
> Can anyone help? Anything I need to tune?
> Thanks in advance
> users mailing list