Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gathering process
From: Brian Budge (brian.budge_at_[hidden])
Date: 2012-10-09 10:09:26


Hi Ralph -

Is this really true? I've been using thread_multiple in my openmpi
programs for quite some time... There may be known cases where it
will not work, but for vanilla MPI use, it seems good to go. That's
not to say that you can't create your own deadlock if you're not
careful, but they are cases you'd expect deadlock. What specifically
is unsupported about thread_multiple?

  Brian

On Tue, Oct 9, 2012 at 6:30 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> We don't support thread_multiple, I'm afraid. Only thread_funneled, so
> you'll have to architect things so that each process can perform all its MPI
> actions inside of a single thread.
>
>
>
> On Tue, Oct 9, 2012 at 6:10 AM, Hodge, Gary C <gary.c.hodge_at_[hidden]> wrote:
>>
>> FYI, I implemented the harvesting thread but found out quickly that my
>> installation of open MPI does not have MPI_THREAD_MULIPLE support
>>
>> My worker thread still does MPI_Send calls to move the data to the next
>> process.
>>
>> So I am going to download 1.6.2 today, configure it with
>> --enable-thread-multiple and try again
>>
>>
>>
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
>> Behalf Of Ralph Castain
>> Sent: Thursday, October 04, 2012 8:10 PM
>>
>>
>> To: Open MPI Users
>> Subject: Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gathering
>> process
>>
>>
>>
>> Sorry for delayed response - been on the road all day.
>>
>> Usually we use the standard NetPipe, IMB, and other benchmarks to measure
>> latency. IIRC, these are all point-to-point measurements - i.e., they
>> measure the latency for a single process sending to one other process
>> (typically on the order of a couple of microseconds). The tests may have
>> multiple processes running, but they don't have one process receiving
>> messages from multiple senders.
>>
>> You will, of course, see increased delays in that scenario just due to
>> cycle time - we give you a message, but cannot give you another one until
>> you return from our delivery callback. So the longer you spend in the
>> callback, the slower we go.
>>
>> In one use-case I recently helped with, we had a "harvesting" thread that
>> simply reaped the messages from the MPI callback and stuffed them into a
>> multi-threaded processing queue. This minimized the MPI "latency", but of
>> course the overall thruput depended on the speed of the follow-on queue. In
>> our case, we only had one process running on each node (like you), and had
>> lots of cores on the node - so we cranked up the threads in the processing
>> queue and rammed the data thru the pipe.
>>
>> Your design looks similar, so you might benefit from a similar approach.
>> Just don't try to have multiple MPI callbacks each sitting in a separate
>> thread as thread support in MPI isn't good - better to have a single thread
>> handling the MPI stuff, and then push it into a queue that multiple threads
>> can access.
>>
>> Anyway, glad that helped diagnose the issue.
>> Ralph
>>
>>
>>
>> On Thu, Oct 4, 2012 at 6:55 AM, Hodge, Gary C <gary.c.hodge_at_[hidden]>
>> wrote:
>>
>> Once I read your comment, Ralph, about this being “orders of magnitude
>> worse than anything we measure”, I knew it had to be our problem
>>
>>
>>
>> We already had some debug code in place to measure when we send and when
>> we receive over MPI. I turned this code on and ran with 12 slaves instead
>> of 4.
>>
>> Our debug showed that once an SP does a send, it is received at the GP in
>> less than 1 ms. I then decided to take a close look at when each SP was
>> sending a message.
>>
>> It turns out that the first 9 slaves send out messages at very regular
>> intervals, but the last 3 slaves have 200 - 600 ms delays in sending out a
>> message.
>>
>> It could be that our SPs have a problem when many are running at once. It
>> is also interesting to note that the first 9 slaves run on the same blade
>> chassis as the GP and
>>
>> the last 3 SPs run on our second blade chassis. I will later experiment
>> with the placement of our SPs across chassis to see if this an important
>> factor or not.
>>
>>
>>
>> When I first reported this problem, I had only turned on debug in the
>> receiving GP process. The latency I was seeing then was the difference
>> between when I received a message
>>
>> from the 10th slave and when I received the last message from the 10th
>> slave. The time we use for our debug comes from an MPI_Wtime call.
>>
>>
>>
>> Ralph, for my future reference, could you share how many processes were
>> sending to a single process in your testing, and what were the size of the
>> messages sent?
>>
>>
>>
>> Hristo, thanks for your input, I had already spent a few days searching
>> the faqs and tuning guides before posting.
>>
>>
>>
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
>> Behalf Of Ralph Castain
>> Sent: Wednesday, October 03, 2012 4:01 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] EXTERNAL: Re: unacceptable latency in gathering
>> process
>>
>>
>>
>> Hmmm...you probably can't without digging down into the diagnostics.
>>
>>
>>
>> Perhaps we could help more if we had some idea how you are measuring this
>> "latency". I ask because that is orders of magnitude worse than anything we
>> measure - so I suspect the problem is in your app (i.e., that the time you
>> are measuring is actually how long it takes you to get around to processing
>> a message that was received some time ago).
>>
>>
>>
>>
>>
>> On Oct 3, 2012, at 11:52 AM, "Hodge, Gary C" <gary.c.hodge_at_[hidden]>
>> wrote:
>>
>>
>>
>> how do I tell the difference between when the message was received and
>> when the message was picked up in MPI_Test?
>>
>>
>>
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
>> Behalf Of Ralph Castain
>> Sent: Wednesday, October 03, 2012 1:00 PM
>> To: Open MPI Users
>> Subject: EXTERNAL: Re: [OMPI users] unacceptable latency in gathering
>> process
>>
>>
>>
>> Out of curiosity, have you logged the time when the SP called "send" and
>> compared it to the time when the message was received, and when that message
>> is picked up in MPI_Test? In other words, have you actually verified that
>> the delay is in the MPI library as opposed to in your application?
>>
>>
>>
>>
>>
>> On Oct 3, 2012, at 9:40 AM, "Hodge, Gary C" <gary.c.hodge_at_[hidden]> wrote:
>>
>>
>>
>> Hi all,
>>
>> I am running on an IBM BladeCenter, using Open MPI 1.4.1, and opensm
>> subnet manager for Infiniband
>>
>>
>>
>> Our application has real time requirements and it has recently been proven
>> that it does not scale to meet future requirements.
>>
>> Presently, I am re-organizing the application to process work in a more
>> parallel manner then it does now.
>>
>>
>>
>> Jobs arrive at the rate of 200 per second and are sub-divided into groups
>> of objects by a master process (MP) on its own node.
>>
>> The MP then assigns the object groups to 20 slave processes (SP), each
>> running on their own node, to do the expensive computational work in
>> parallel.
>>
>> The SPs then send their results to a gatherer process (GP) on its own node
>> that merges the results for the job and sends it onward for final
>> processing.
>>
>> The highest latency for the last 1024 jobs that were processed is then
>> written to a log file that is displayed by a GUI.
>>
>> Each process uses the same controller method for sending and receiving
>> messages as follows:
>>
>>
>>
>> For (each CPU that sends us input)
>>
>> {
>>
>> MPI_Irecv(….)
>>
>> }
>>
>>
>>
>> While (true)
>>
>> {
>>
>> For (each CPU that sends us input)
>>
>> {
>>
>> MPI_Test(….)
>>
>> If (message was received)
>>
>> {
>>
>> Copy the message
>>
>> Queue the copy to our input queue
>>
>> MPI_Irecv(…)
>>
>> }
>>
>> }
>>
>> If (there are messages on our input queue)
>>
>> {
>>
>> … process the FIRST message on queue (this may queue
>> messages for output) ….
>>
>>
>>
>> For (each message on our output queue)
>>
>> {
>>
>> MPI_Send(…)
>>
>> }
>>
>> }
>>
>> }
>>
>>
>>
>> My problem is that I do not meet our applications performance requirements
>> for a job (~ 20 ms) until I reduce the number of SPs from 20 to 4 or less.
>>
>> I added some debug into the GP and found that there are never more than 14
>> messages received in the for loop that calls MPI_Test.
>>
>> The messages that were sent from the other 6 SPs will eventually arrive at
>> the GP in a long stream after experiencing high latency (over 600 ms).
>>
>>
>>
>> Going forward, we need to handle more objects per job and will need to
>> have more than 4 SPs to keep up.
>>
>> My thought is that I have to obey this 4 SPs to 1 GP ratio and create
>> intermediate GPs to gather results from every 4 slaves.
>>
>>
>>
>> Is this a contention problem at the GP?
>>
>> Is there debugging or logging I can turn on in the MPI to prove that
>> contention is occurring?
>>
>> Can I configure MPI receive processing to improve upon the 4 to 1 ratio?
>>
>> Can I improve the controller method (listed above) to gain a performance
>> improvement?
>>
>>
>>
>> Thanks for any suggestions.
>>
>> Gary Hodge
>>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users