Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] knem/openmpi performance?
From: Iliev, Hristo (Iliev_at_[hidden])
Date: 2013-07-18 09:57:08


> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
> On Behalf Of Dave Love
> Sent: Thursday, July 18, 2013 1:22 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] knem/openmpi performance?
>
> Paul Kapinos <kapinos_at_[hidden]> writes:
>
> > Jeff, I would turn the question the other way around:
> >
> > - are there any penalties when using KNEM?
>
> Bull should be able to comment on that -- they turn it on by default in
their
> proprietary OMPI derivative -- but I doubt I can get much of a story on
it.
> Mellanox ship it now too, but I don't know if their distribution defaults
to
> using it.
>
> I expect to use knem on hardware that's essentially the same as Mark's.
> If any issues appear in production, I'll be surprised and will report
them.
>
> > We have a couple of Really Big Nodes (128 cores) with non-huge memory
> > bandwidth (because coupled of 4x standalone nodes with 4 sockets
> > each).
>
> I was hoping to have some results for just such a setup, but haven't been
> able to spend any time on it this week. If there are any suggestions for
OMPI
> tuning on it I'd be interested.
>

Detailed results are coming in the near future, but the benchmarks done up
to now indicate that collectives that use bulk (non-segmented) transfers,
e.g. MPI_Alltoall with large chunks, receive quite a decent speed bump with
KNEM transfers - e.g. 1.5x speed-up for 128 processes and 4 MiB data chunks
- while those that use pipelines, e.g. MPI_Bcast with large messages and
many processes, suffer big time since the default algorithm selection
heuristics are inadequate - e.g. an 8 MiB message is pipelined to 127 other
processes using segment size of 8 KiB and with KNEM it takes forever = more
than 10x longer than with the user-space double-copy method - and therefore
one has to override the heuristics by providing a proper set of dynamic
rules in a largely undocumented file format.

> > So cutting the bandwidth in halves on these nodes sound like Very Good
> > Thing.
> >
> > But otherwise we've 1500+ nodes with 2 sockets and 24GB memory only
> > and we do not wanna to disturb the production on these nodes.... (and
> > different MPI versions for different nodes are doofy).
>
> Why would you need that? Our horribly heterogeneous cluster just has a
> node group-specific openmpi-mca-params.conf, and SGE parallel
> environments keep jobs in specific host groups with basically the same CPU
> speed and interconnect.
>

MPI_Alltoall(v) with large chunks seems to benefit on those machines too.
And we have a number of applications that perform lots of single-node
all-to-all operations.

> >
> > Best
> >
> > Paul
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Regards,
Hristo

--
Hristo Iliev, PhD - High Performance Computing Team
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23, D 52074 Aachen (Germany)


  • application/pkcs7-signature attachment: smime.p7s