Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [RFC] Low pressure OPAL progress
From: Ashley Pittman (ashley_at_[hidden])
Date: 2009-06-09 08:59:40


On Mon, 2009-06-08 at 17:50 +0200, Sylvain Jeaugey wrote:
> Principle
> =========
>
> opal_progress() ensures the progression of MPI communication. The current
> algorithm is a loop calling progress on all registered components. If the
> program is blocked, the loop will busy-poll indefinetely.

I have some experience here due to implementing this feature (blocking
waits) on Quadrics hardware. You're right that it can have benefits and
yielding the CPU when "idle" is a good thing in the general case.

The "correct" way for a process to relinquish the cpu is to block in a
select() or poll() call until data is received whereupon it can wake up
and continue working, the major problem each and every MPI
implementation has is that select() only works for tcp/ip and not for
shared memory or any of the more exotic networks. IMHO it would be much
preferred to solve this problem properly and block in the wakeable
select() rather than usleep().

In my experience when done correctly the performance is affected however
surprisingly it can often lead to increased performance, we had full
coverage however so were able to sleep early and wake up in a timely
manner on receiving any message. Yeilding even one cpu per node from
the application occasionally gives any background/os processing a chance
to run without impacting the performance of the application so enabling
blocking waits can lead to quicker runtimes.

> Going to sleep after a certain amount of time with nothing received is
> interesting for two things :
>
> - Administrator can easily detect whether a job is deadlocked : all the
> processes are in sleep(). Currently, all processors are using 100% cpu and
> it is very hard to know if progression is still happening or not.

This is a valuable thing to know however I don't view the proposed
solution as the correct one, if this were the problem you were aiming to
solve I'd recommend a different approach, more like the llnl solution
that Ralph described.

Yours,

Ashley Pittman.

-- 
Ashley Pittman, Bath, UK.
Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk