Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] About MPI_TAG_UB
From: Sébastien Boisvert (sebastien.boisvert.3_at_[hidden])
Date: 2012-09-28 10:38:26


Hello,

On 28/09/12 10:00 AM, Jeff Squyres wrote:
> On Sep 28, 2012, at 9:50 AM, Sébastien Boisvert wrote:
>
>> I did not know about shared queues.
>>
>> It does not run out of memory. ;-)
>
> It runs out of *registered* memory, which could be far less than your actual RAM. Check this FAQ item in particular:
>
> http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem
>

I see.

$ cat /sys/module/mlx4_core/parameters/log_num_mtt
0

$ cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
0

$ getconf PAGE_SIZE
4096

With the formula

max_reg_mem = (2^log_num_mtt) * (2^log_mtts_per_seg) * PAGE_SIZE

            = (2^0) * (2^0) * 4096

            = 1 * 1 * 4096

            = 4096 bytes

Whoa ! one page.

That should help.

There are 32 GiB of memory.

So I will ask someone to set log_num_mtt=23 and log_mtts_per_seg=1.

  => 68719476736 = (2**23)*(2**1)*4096

>> But the latency is not very good.
>>
>> ** Test 1
>>
>> --mca btl_openib_max_send_size 4096 \
>> --mca btl_openib_eager_limit 4096 \
>> --mca btl_openib_rndv_eager_limit 4096 \
>> --mca btl_openib_receive_queues S,4096,2048,1024,32 \
>>
>> I get 1.5 milliseconds.
>>
>> => https://gist.github.com/3799889
>>
>> ** Test 2
>>
>> --mca btl_openib_receive_queues S,65536,256,128,32 \
>>
>> I get around 1.5 milliseconds too.
>>
>> => https://gist.github.com/3799940
>
> Are you saying 1.5us is bad?

1.5 us is very good. But I get 1.5 ms with shared queues (see above).

> That's actually not bad at all. On the most modern hardware with a bunch of software tuning, you can probably get closer to 1us.
>
>> With my virtual router I am sure I can get something around 270 microseconds.
>
> OTOH, that's pretty bad. :-)

I know, all my Ray processes are doing busy waiting, if MPI was event-driven,
I would call my software sleepy Ray when latency is high.

>
> I'm not sure why it would be so bad -- are you hammering the virtual router with small incoming messages?

There are 24 AMD Opteron(tm) Processor 6172 cores for 1 Mellanox Technologies MT26428
on each node. That may be the cause too.

> You might need to do a little profiling to see where the bottlenecks are.
>

Well, with the very valuable information you provided about log_num_mtt and log_mtts_per_seg
for the Linux kernel module mlx4_core, I think this may be the root of our problem.

We get 20-30 us on 4096 processes on Cray XE6, so it is unlikely that the bottleneck is in
our software.

>> Just out of curiosity, does Open-MPI utilize heavily negative values
>> internally for user-provided MPI tags ?
>
> I know offhand we use them for collectives. Something is tickling my brain that we use them for other things, too (CID allocation, perhaps?), but I don't remember offhand.
>

The only collective I use is a few MPI_Barrier.

> I'm just saying: YMMV. Buyer be warned. And all that. :-)
>

Yes, I agree on this, non-portable code is not portable and all with unexpected behaviors.

>> If the negative tags are internal to Open-MPI, my code will not touch
>> these private variables, right ?
>
> It's not a variable that's the issue. If you do a receive for tag -3 and OMPI sends an internal control message with tag -3, you might receive it instead of OMPI's core. And that would be Bad.
>

Ah I see. By removing the checks in my silly patch, I can now dictate things to do to
OMPI. Hehe.