On 28/09/12 10:00 AM, Jeff Squyres wrote:
> On Sep 28, 2012, at 9:50 AM, Sébastien Boisvert wrote:
>> I did not know about shared queues.
>> It does not run out of memory. ;-)
> It runs out of *registered* memory, which could be far less than your actual RAM. Check this FAQ item in particular:
$ cat /sys/module/mlx4_core/parameters/log_num_mtt
$ cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
$ getconf PAGE_SIZE
With the formula
max_reg_mem = (2^log_num_mtt) * (2^log_mtts_per_seg) * PAGE_SIZE
= (2^0) * (2^0) * 4096
= 1 * 1 * 4096
= 4096 bytes
Whoa ! one page.
That should help.
There are 32 GiB of memory.
So I will ask someone to set log_num_mtt=23 and log_mtts_per_seg=1.
=> 68719476736 = (2**23)*(2**1)*4096
>> But the latency is not very good.
>> ** Test 1
>> --mca btl_openib_max_send_size 4096 \
>> --mca btl_openib_eager_limit 4096 \
>> --mca btl_openib_rndv_eager_limit 4096 \
>> --mca btl_openib_receive_queues S,4096,2048,1024,32 \
>> I get 1.5 milliseconds.
>> => https://gist.github.com/3799889
>> ** Test 2
>> --mca btl_openib_receive_queues S,65536,256,128,32 \
>> I get around 1.5 milliseconds too.
>> => https://gist.github.com/3799940
> Are you saying 1.5us is bad?
1.5 us is very good. But I get 1.5 ms with shared queues (see above).
> That's actually not bad at all. On the most modern hardware with a bunch of software tuning, you can probably get closer to 1us.
>> With my virtual router I am sure I can get something around 270 microseconds.
> OTOH, that's pretty bad. :-)
I know, all my Ray processes are doing busy waiting, if MPI was event-driven,
I would call my software sleepy Ray when latency is high.
> I'm not sure why it would be so bad -- are you hammering the virtual router with small incoming messages?
There are 24 AMD Opteron(tm) Processor 6172 cores for 1 Mellanox Technologies MT26428
on each node. That may be the cause too.
> You might need to do a little profiling to see where the bottlenecks are.
Well, with the very valuable information you provided about log_num_mtt and log_mtts_per_seg
for the Linux kernel module mlx4_core, I think this may be the root of our problem.
We get 20-30 us on 4096 processes on Cray XE6, so it is unlikely that the bottleneck is in
>> Just out of curiosity, does Open-MPI utilize heavily negative values
>> internally for user-provided MPI tags ?
> I know offhand we use them for collectives. Something is tickling my brain that we use them for other things, too (CID allocation, perhaps?), but I don't remember offhand.
The only collective I use is a few MPI_Barrier.
> I'm just saying: YMMV. Buyer be warned. And all that. :-)
Yes, I agree on this, non-portable code is not portable and all with unexpected behaviors.
>> If the negative tags are internal to Open-MPI, my code will not touch
>> these private variables, right ?
> It's not a variable that's the issue. If you do a receive for tag -3 and OMPI sends an internal control message with tag -3, you might receive it instead of OMPI's core. And that would be Bad.
Ah I see. By removing the checks in my silly patch, I can now dictate things to do to