IIRC, the first 16 or so messages over the openib btl uses the
send/recv API as opposed to rdma which is significantly faster. I
am not sure as to how 1.5.3 and multi-rail affects this but the
preconnected I believe short circuits when one cuts over to use rdma
for eager messages.
On 10/31/2012 3:36 PM, Paul Kapinos
Open MPI is clever and use by default multiple IB adapters, if
Open MPI is lazy and establish connections only iff needed.
Both is good.
We have kinda special nodes: up to 16 sockets, 128 cores, 4
boards, 4 IB cards. Multirail works!
The crucial thing is, that starting with v1.6.1 the latency of the
very first PingPong sample between two nodes take really a lot of
time - some 100x - 200x of usual latency. You cannot see this
using usual latency benchmark(*) because they tend to omit the
first samples as "warmup phase", but we use a kinda self-written
parallel test which clearly show this (and let me to muse some
If Miltirail is forbidden (-mca btl_openib_max_btls 1), or if
v.1.5.3 used, or if the MPI processes are preconnected
there is no such huge latency outliers for the first sample.
Well, we know about the warm-up and lazy connections.
But 200x ?!
Any comments about that is OK so?
(*) E.g. HPCC explicitely say in
> Additional startup latencies are masked out by starting the
> one non-measured ping-pong.
P.S. Sorry for cross-posting to both Users and Developers, but my
last questions to Users have no reply until yet, so trying to
devel mailing list