On Sep 28, 2012, at 10:38 AM, Sébastien Boisvert wrote:
> 1.5 us is very good. But I get 1.5 ms with shared queues (see above).
Oh, I mis-read (I blame it on jet-lag...).
Yes, that seems waaaay too high.
You didn't do a developer build, did you? We add a bunch of extra debugging in developer builds that adds a bunch of latency. And you're not running over-subscribed, right?
>> OTOH, that's pretty bad. :-)
> I know, all my Ray processes are doing busy waiting, if MPI was event-driven,
> I would call my software sleepy Ray when latency is high.
>> I'm not sure why it would be so bad -- are you hammering the virtual router with small incoming messages?
> There are 24 AMD Opteron(tm) Processor 6172 cores for 1 Mellanox Technologies MT26428 on each node. That may be the cause too.
That's a QDR HCA, right? (i.e., I assume it's very recent)
Try running some simple point-to-point benchmarks and see if you're getting the same latency (i.e., don't run benchmarks in your app -- get a baseline with some well-known benchmarks first).
>> You might need to do a little profiling to see where the bottlenecks are.
> Well, with the very valuable information you provided about log_num_mtt and log_mtts_per_seg for the Linux kernel module mlx4_core, I think this may be the root of our problem.
It is definitely a cause, but perhaps not the only cause.
> We get 20-30 us on 4096 processes on Cray XE6, so it is unlikely that the bottleneck is in our software.
Possibly not. But every environment is different, and the same software can perform differently in different environments.
> Yes, I agree on this, non-portable code is not portable and all with unexpected behaviors.
> Ah I see. By removing the checks in my silly patch, I can now dictate things to do to OMPI. Hehe.
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/