On 8/18/2012 3:54 AM, Jeff Squyres wrote:
> Mike / Yevgeny --
> Can you comment on what is going on here? It would be really good to understand exactly what these 2 MLX4 parameters are (e.g., why you suggested increasing one and not the other), and why there would be differences in registering small numbers of large chunks of contiguous memory vs. large numbers small contiguous chunks of memory... Also, is there a magic property about being able to register 2x physical memory? I was under the impression that just being able to register anything>= 1x physical memory was sufficient.
I'm on vacation and mostly off-line, but please see below some info that hopefully
answers the questions. If I missed something, I'll dig more info about it when I
get back (somewhere around Thursday).
So we're talking about log_num_mtt and log_mtts_per_seg, which are parameters
that control memory translation table (MTT).
MTT has segments, each segment has entries. Each entry can hold one translation,
which means that it can let you register one page.
log_num_mtt controls number of MTT segments (logarithmic scale), log_mtts_per_seg
controls number of entries per segment.
Each memory registration uses either whole segment, or multiples of segments.
You can't have two separate memory registrations in the same segment, even if
there are unused entries in the segment.
So what do we get? MTT fragmentation.
Larger segments - more internal fragmentation, but less segments used per registration.
Smaller segments - less fragmentation, but more segments per registration.
Every application is different, so YMMV. I don't have any extensive research to back
my statement, but I've been told that sometimes smaller segments have a benefit.
You can try both ways and see if there is a difference. There's big chance you won't see any.
As for 2x physical memory: because of MTT internal fragmentation, you need the MTT to
have more entries than there are physical pages in the memory. 2x seems enough.
> Paul: I think we're happy enough for 1.6.1. We can always make this better in 1.6.2, but I think we've fixed the major problems enough for a release.
> On Aug 14, 2012, at 11:26 AM, Paul Kapinos wrote:
>> Hi Jeff,
>> Hi All,
>> On 08/07/12 18:51, Jeff Squyres wrote:
>>> So I'm not 100% clear on what you mean here: when you set the OFED params to
>>> allow registration of more memory than you have physically,
>>> does the problem go away?
>> We are talking about machines with 24GB RAM (S) and 96GB RAM (L).
>> The default values for Mellanox/OFED parameter are 20/3 => 32GB registerable memory (RM) on both S and L. This is more than memory of S, but less than 2x memory of S, and less than memory of L.
>> If the OFED parameter are pimped to at least RM=64GB (20/3 => 21/3, 22/3, 24/3) there are no errors, I've just tested it with 8GB respectively 15.5 GB of data (starting usually 1x ppn).
>> If the OFED parameter are _not_ changed (=32GB RM) there is _no_ warning on S nodes; on L nodes this warns the user:
>> WARNING: It appears that your OpenFabrics subsystem is configured to only
>> Registerable memory: 32768 MiB
>> Total memory: 98293 MiB
>> .. hardly surprising - the warning came if and only if (RM< memory).
>> If the OFED parameter are _not_ changed (=32GB RM) and I'm trying to send at least 8GB _in one chunk_ then the 'queue pair' error came out (see S_log.txt and my last mail). More exactly at least one process seem to die in MPI_Finalize (all output of the program is correct). The same error came out also on L nodes, surrounded by the above warning (L_log.txt).
>>>> From your log messages, the warning messages were from machines with
>>>> nearly 100GB RAM but only 32GB register-able. But only one of those was
>>>> the same as one that showed QP creation failures.
>>>> So I'm not clear which was which.
>>> Regardless: can you pump the MTT params up to allow registering all
>>>> of physical memory on those machines, and see if you get any failures?
>> as you can see on a node with 24GB memory and 32GB RM there can be a failure without any warning from Open MPI side :-(
>>> To be clear: we're trying to determine if we should spend more effort
>>> on making OMPI work properly in low-registered-memory-availabile
>>> scenarios, or whether we should just emphasize
>>> "go reset your MTT parameters to allow registering all of physical memory."
>> After making the experience with failures when only 1.5x of phys.mem. is allowed for registering I would follow Mellanox in "go reset your MTT to allow _twice_ the phys.memory".
>> - if the OFED parameter are pimped everything is OK
>> - there is a [rare] combination when your great workaround did not catch.
>> - allowing 2x memory for being registered could be a good idea.
>> Does this make sense?
>> Paul Kapinos
>> P.S. The used example program is of course an synthetical thing but it is strongly sympathized with the Serpent software. (however serpent usually use chunks whereby the actual error arise if all the 8GB are send in one piece).
>> P.S.2 When all works, with increasing chunk size to HUGE values, the performance seem to became worse - sending all 15.5 GB in one piece is more than twice slower than sending with 200 mb pieces. See chunked_send.txt
>> (the first parameter is #doubles of data, the 2nd is #doubles in a chunk).
>> P.S.3 all experiments above with 1.6.1rc2
>> P.S.4. I'm also performing some linpack runs with 6x nodes and my very first impression is that increasing log_num_mtt to huge values is a bad idea (performance loss of some 5%). But let me finish it first...
>> Dipl.-Inform. Paul Kapinos - High Performance Computing,
>> RWTH Aachen University, Center for Computing and Communication
>> Seffenter Weg 23, D 52074 Aachen (Germany)
>> Tel: +49 241/80-24915