Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Heads up on new feature to 1.3.4
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2009-08-17 06:02:50


In the multi job environment, can't we just start binding processes on the
first avaliable and unused socket?
I mean first job/user will start binding itself from socket 0,
the next job/user will start binding itself from socket 2, for instance .
Lenny.

On Mon, Aug 17, 2009 at 6:02 AM, Ralph Castain <rhc_at_[hidden]> wrote:

>
> On Aug 16, 2009, at 8:16 PM, Eugene Loh wrote:
>
> Chris Samuel wrote:
>
> ----- "Eugene Loh" <Eugene.Loh_at_[hidden]> <Eugene.Loh_at_[hidden]> wrote:
>
>
> This is an important discussion.
>
>
> Indeed! My big fear is that people won't pick up the significance
> of the change and will complain about performance regressions
> in the middle of an OMPI stable release cycle.
>
> 2) The proposed OMPI bind-to-socket default is less severe. In the
> general case, it would allow multiple jobs to bind in the same way
> without oversubscribing any core or socket. (This comment added to
> the trac ticket.)
>
>
> That's a nice clarification, thanks. I suspect though that the
> same issue we have with MVAPICH would occur if two 4 core jobs
> both bound themselves to the first socket.
>
>
> Okay, so let me point out a second distinction from MVAPICH: the default
> policy would be to spread out over sockets.
>
> Let's say you have two sockets, with four cores each. Let's say you submit
> two four-core jobs. The first job would put two processes on the first
> socket and two processes on the second. The second job would do the same.
> The loading would be even.
>
> I'm not saying there couldn't be problems. It's just that MVAPICH2 (at
> least what I looked at) has multiple shortfalls. The binding is to fill up
> one socket after another (which decreases memory bandwidth per process and
> increases chances of collisions with other jobs) and binding is to core
> (increasing chances of oversubscribing cores). The proposed OMPI behavior
> distributes over sockets (improving memory bandwidth per process and
> reducing collisions with other jobs) and binding is to sockets (reducing
> changes of oversubscribing cores, whether due to other MPI jobs or due to
> multithreaded processes). So, the proposed OMPI behavior mitigates the
> problems.
>
> It would be even better to have binding selections adapt to other bindings
> on the system.
>
> In any case, regardless of what the best behavior is, I appreciate the
> point about changing behavior in the middle of a stable release. Arguably,
> leaving significant performance on the table in typical situations is a bug
> that warrants fixing even in the middle of a release, but I won't try to
> settle that debate here.
>
>
> I think the problem here, Eugene, is that performance benchmarks are far
> from the typical application. We have repeatedly seen this - optimizing for
> benchmarks frequently makes applications run less efficiently. So I concur
> with Chris on this one - let's not go -too- benchmark happy and hurt the
> regular users.
>
> Here at LANL, binding to-socket instead of to-core hurts performance by
> ~5-10%, depending on the specific application. Of course, either binding
> method is superior to no binding at all...
>
> UNLESS you have a threaded application, in which case -any- binding can be
> highly detrimental to performance.
>
> So going slow on this makes sense. If we provide the capability, but leave
> it off by default, then people can test it against real applications and see
> the impact. Then we can better assess the right default settings.
>
> Ralph
>
>
> 3) Defaults (if I understand correctly) can be set differently
> on each cluster.
>
>
> Yes, but the defaults should be sensible for the majority of
> clusters. If the majority do indeed share nodes between jobs
> then I would suggest that the default should be off and the
> minority who don't share nodes should have to enable it.
>
>
> In debates on this subject, I've heard people argue that:
>
> *) Though nodes are getting fatter, most are still thin.
>
> *) Resource managers tend to space share the cluster.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>