IIRC, the first 16 or so messages over the openib btl uses the send/recv API as opposed to rdma which is significantly faster.  I am not sure as to how 1.5.3 and multi-rail affects this but the preconnected I believe short circuits when one cuts over to use rdma for eager messages.


On 10/31/2012 3:36 PM, Paul Kapinos wrote:
Hello all,

Open MPI is clever and use by default multiple IB adapters, if available.

Open MPI is lazy and establish connections only iff needed.

Both is good.

We have kinda special nodes: up to 16 sockets, 128 cores, 4 boards, 4 IB cards. Multirail works!

The crucial thing is, that starting with v1.6.1 the latency of the very first PingPong sample between two nodes take really a lot of time - some 100x - 200x of usual latency. You cannot see this using usual latency benchmark(*) because they tend to omit the first samples as "warmup phase", but we use a kinda self-written parallel test which clearly show this (and let me to muse some days).
If Miltirail is forbidden (-mca btl_openib_max_btls 1), or if v.1.5.3 used, or if the MPI processes are preconnected (http://www.open-mpi.org/faq/?category=running#mpi-preconnect) there is no such huge latency outliers for the first sample.

Well, we know about the warm-up and lazy connections.

But 200x ?!

Any comments about that is OK so?


Paul Kapinos

(*) E.g. HPCC explicitely say in http://icl.cs.utk.edu/hpcc/faq/index.html#132
> Additional startup latencies are masked out by starting the measurement after
> one non-measured ping-pong.

P.S. Sorry for cross-posting to both Users and Developers, but my last questions to Users have no reply until yet, so trying to broadcast...

devel mailing list