I don't think it's a bug in OMPI, but more likely reflects improvements in the default collective algorithms. If you want to further improve performance, you should bind your processes to a core (if your application isn't threaded) or to a socket (if threaded).

As someone previously noted, apps will always run slower on multiple nodes vs everything on a single node due to the shared memory vs IB differences. Nothing you can do about that one.


On Oct 28, 2013, at 10:36 PM, San B <forum.san@gmail.com> wrote:

      As discussed earlier, the executable which was compiled with OpenMPI-1.4.5 gave very low performance of 12338.809 seconds when job executed on two nodes(8 cores per node). The same job run on single node(all 16cores) got executed in just 3692.403 seconds. Now I compiled the application with OpenMPI-1.6.5 and got executed in 5527.320 seconds on two nodes. 

     Is this a performance gain with OMPI-1.6.5 over OMPI-1.4.5 or an issue with OPENMPI itself?


On Tue, Oct 15, 2013 at 5:32 PM, San B <forum.san@gmail.com> wrote:
Hi,

     As per your instruction, I did the profiling of the application with mpiP. Following is the difference between the two runs:

Run 1: 16 mpi processes on single node

@--- MPI Time (seconds) ---------------------------------------------------
---------------------------------------------------------------------------
Task    AppTime    MPITime     MPI%
   0   3.61e+03        661    18.32
   1   3.61e+03        627    17.37
   2   3.61e+03        700    19.39
   3   3.61e+03        665    18.41
   4   3.61e+03        702    19.45
   5   3.61e+03        703    19.48
   6   3.61e+03        740    20.50
   7   3.61e+03        763    21.14
...
...

Run 2: 16 mpi processes on two nodes - 8 mpi processes per node

@--- MPI Time (seconds) ---------------------------------------------------
---------------------------------------------------------------------------
Task    AppTime    MPITime     MPI%
   0   1.27e+04   1.06e+04    84.14
   1   1.27e+04   1.07e+04    84.34
   2   1.27e+04   1.07e+04    84.20
   3   1.27e+04   1.07e+04    84.20
   4   1.27e+04   1.07e+04    84.22
   5   1.27e+04   1.07e+04    84.25
   6   1.27e+04   1.06e+04    84.02
   7   1.27e+04   1.07e+04    84.35
   8   1.27e+04   1.07e+04    84.29


          The time spent for MPI functions in run 1 is less than 20%, where as it is more than 80% in the run 2. For more details, I've attached both output files. Please go thru these files and suggest what optimization we can do with OpenMPI or Intel MKL.

Thanks


On Mon, Oct 7, 2013 at 12:15 PM, San B <forum.san@gmail.com> wrote:

Hi,

I'm facing a  performance issue with a scientific application(Fortran). The issue is, it runs faster on single node but runs very slow on multiple nodes. For example, a 16 core job on single node finishes in 1hr 2mins, but the same job on two nodes (i.e. 8 cores per node & remaining 8 cores kept free) takes 3hr 20mins. The code is compiled with ifort-13.1.1, openmpi-1.4.5 and intel MKL libraries - lapack, blas, scalapack, blacs & fftw. What could be the problem here with?

Is it possible to do any tuning in OpenMPI? FY More info: The cluster has Intel Sandybridge processor (E5-2670), infiniband and Hyperthreading is Enabled. Jobs are submitted thru LSF scheduler.

Does HyperThreading causing any problem here?


Thanks


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users