Terry Dontje wrote:
> Did your affinity script bind the processes per socket or linearly to
> cores. If the former you'll want to look at using rankfiles and place
> the ranks based on sockets. TWe have found this especially useful if
> you are not running fully subscribed on your machines.
The script binds them to sockets and also binds memory per node.
It is smart enough that if the machine_file does not use all
the cores (because the user reordered them) then the script will
lay out the tasks evenly between the two sockets.
> Also, if you think the main issue is collectives performance you may
> want to try using the hierarchical and SM collectives. However, be
> forewarned we are right now trying to pound out some errors with these
> modules. To enable them you add the following parameters "--mca
> coll_hierarch_priority 100 --mca coll_sm_priority 100". We would be
> very interested in any results you get (failures, improvements,
I don't know what it is slow. OpenMPI is so flexible in how the
stack can be tuned. But I also have 100s of users runing dozens
of major codes, and what I need is a set of options that 'just work'
in most cases.
I will try the above options and get back to you.
>> Message: 4
>> Date: Thu, 06 Aug 2009 17:03:08 -0600
>> From: Craig Tierney <Craig.Tierney_at_[hidden]>
>> Subject: Re: [OMPI users] Performance question about OpenMPI and
>> MVAPICH2 on IB
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <4A7B612C.8070501_at_[hidden]>
>> Content-Type: text/plain; charset=ISO-8859-1
>> A followup....
>> Part of problem was affinity. I had written a script to do processor
>> and memory affinity (which works fine with MVAPICH2). It is an
>> idea that I got from TACC. However, the script didn't seem to
>> work correctly with OpenMPI (or I still have bugs).
>> Setting --mca mpi_paffinity_alone 1 made things better. However,
>> the performance is still not as good:
>> Cores Mvapich2 Openmpi
>> 8 17.3 17.3
>> 16 31.7 31.5
>> 32 62.9 62.8
>> 64 110.8 108.0
>> 128 219.2 201.4
>> 256 384.5 342.7
>> 512 687.2 537.6
>> The performance number is GFlops (so larger is better).
>> The first few numbers show that the executable is the right
>> speed. I verified that IB is being used by using OMB and
>> checking latency and bandwidth. Those numbers are what I
>> expect (3GB/s, 1.5mu/s for QDR).
>> However, the Openmpi version is not scaling as well. Any
>> ideas on why that might be the case?
> users mailing list