Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] unacceptable latency in gathering process
From: Iliev, Hristo (iliev_at_[hidden])
Date: 2012-10-04 05:01:56



I would suggest that (if you haven't done it already) you trace your
program's execution with Vampir or Scalasca. The latter has some pretty nice
analysis capabilities built-in and can detect common patterns that would
make your code not to scale, no matter how good the MPI library is. Also
Open MPI has many knobs that you can tune via MCA parameters. Start with the
general tuning FAQ:


then move to the InfiniBand tuning FAQ:


Kind regards,


Hristo Iliev, Ph.D. -- High Performance Computing
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23,  D 52074  Aachen (Germany)
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
Behalf Of Hodge, Gary C
Sent: Wednesday, October 03, 2012 6:41 PM
To: users_at_[hidden]
Subject: [OMPI users] unacceptable latency in gathering process
Hi all,
I am running on an IBM BladeCenter, using Open MPI 1.4.1, and opensm subnet
manager for Infiniband
Our application has real time requirements and it has recently been proven
that it does not scale to meet future requirements.
Presently, I am re-organizing the application to process work in a more
parallel manner then it does now.
Jobs arrive at the rate of 200 per second and are sub-divided into groups of
objects by a master process (MP) on its own node.
The MP then assigns the object groups to 20 slave processes (SP), each
running on their own node, to do the expensive computational work in
The SPs then send their results to a gatherer process (GP) on its own node
that merges the results for the job and sends it onward for final
The highest latency for the last 1024 jobs that were processed is then
written to a log file that is displayed by a GUI.
Each process uses the same controller method for sending and  receiving
messages as follows:
For (each CPU that sends us input)
While (true)
                For (each CPU that sends us input)
If (message was received)
                Copy the message
Queue the copy to our input queue
If (there are messages on our input queue)
                . process the FIRST message on queue (this may queue
messages for output) ..
                For (each message on our output queue)
My problem is that I do not meet our applications performance requirements
for a job (~ 20 ms) until I reduce the number of SPs from 20 to 4 or less.
I added some debug into the GP and found that there are never more than 14
messages received in the for loop that calls MPI_Test.
The messages that were sent from the other 6 SPs will eventually arrive at
the GP in a long stream after experiencing high latency (over 600 ms).
Going forward, we need to handle more objects per job and will need to have
more than 4 SPs to keep up.
My thought is that I have to obey this 4 SPs to 1 GP ratio and create
intermediate GPs to gather results from every 4 slaves.
Is this a contention problem at the GP?
Is there debugging or logging I can turn on in the MPI to prove that
contention is occurring?
Can I configure MPI receive processing to improve upon the 4 to 1 ratio?
Can I improve the controller method (listed above) to gain a performance
Thanks for any suggestions.
Gary Hodge

  • application/pkcs7-signature attachment: smime.p7s