On Aug 3, 2012, at 6:24 PM, Paul Kapinos wrote:
> testing our well-known example of the registered memory problem (see http://www.open-mpi.org/community/lists/users/2012/02/18565.php) on freshly-installed 1.6.1rc2, found out that "Fall back to send/receive semantics" did not work always it. However the behaviour has changed:
> 1.5.3. and older: MPI processes hang and block the IB interface(s) forever
> 1.6.1rc2: a) MPI processes run through (if the chunk size is less than 8Gb) with or without a warning; or
> b) MPI processes die (if the chunk size is more than 8Gb)
We talked about this mail today on our weekly teleconference.
That's odd. Looking at your output files, I see that they when trying to create a queue pair. Let me explain...
Our newest stop-gap scheme on the 1.6 branch is as follows:
- figure out how much physical RAM is on the machine
- take 85% of that number
- M = (85% of physical_memory / num_mpi_procs_on_machine)
- don't let any individual MPI process register more than M bytes of memory
This is a heuristic. The idea is that we leave 15% of memory available to the OS and other OpenFabrics services running on the machine (IPoIB, subnet management, ...etc.). However, there is a variable OMPI doesn't count -- the amount of registered memory consumed by the meta data consumed by a queue pair.
When you take into account the fact that OMPI creates queue pairs lazily (in an attempt to reduce registered memory consumption, which is fairly ironic here ;-) ), we could still run out of registered memory and then try to create a new QP later (e.g., the first time MPI process A sends to B). This QP could fail to be created if there is no more registered memory.
That's the type of error that I see in your log files (QP creation fail).
But with 15% of RAM left, we're greatly surprised to see this kind of error. Perhaps registering 8+GB buffers does something in the OpenFabrics registration system that we're unaware of (to make overall available registered memory deplete faster). Huh.
> Note that the same program which die in (b) run fine over IPoIB (-mca btl ^openib). However, the performance is very bad in this case... some 1100 sec. instead of about a minute.
Yep, that makes sense. IPoIB is quite inefficient.
> Reproducing: compile attached file and let it run on nodes with >=24GB with
> log_num_mtt : 20
> log_mtts_per_seg: 3
> (=32Gb, our default values):
> $ mpiexec ....<one proc per node> .... a.out 1080000000 1080000001
So I'm not 100% clear on what you mean here: when you set the OFED params to allow registration of more memory than you have physically, does the problem go away?
>From your log messages, the warning messages were from machines with nearly 100GB RAM but only 32GB register-able. But only one of those was the same as one that showed QP creation failures. So I'm not clear which was which.
Regardless: can you pump the MTT params up to allow registering all of physical memory on those machines, and see if you get any failures?
To be clear: we're trying to determine if we should spend more effort on making OMPI work properly in low-registered-memory-availabile scenarios, or whether we should just emphasize "go reset your MTT parameters to allow registering all of physical memory."
> Well, we know about the need to raise the values of one of these parameters, but I wanted to let you to know that your workaround for the problem is still not 100% perfect but only 99%.
Ok, good. Many thanks for your patience with all of this.
> P.S: A note about the informative warning:
> WARNING: It appears that your OpenFabrics subsystem is configured to only
> allow registering part of your physical memory.
> Registerable memory: 32768 MiB
> Total memory: 98293 MiB
> On node with 24 GB this warning did not came around, although the max. size of registered memory (32GB) is only 1.5x of RAM, but in
> at least the 2x RAM size is recommended.
> Should this warning not came out in all cases when registered memory < 2x RAM?
You're correct -- perhaps this is bad wording in the FAQ. As far as I understand it, it's only necessary to be able to register all of physical memory. Although Mellanox did recommend being able to register at least 2x physical memory.
...that being said, I have never gotten a clear explanation of what exactly those two parameters are. I.e., why you would adjust one and not the other, etc. Mellanox advised us (the OMPI developers) to adjust log_num_mtt, but I never found out why.
We'll continue to pester Mellanox to try to get a good answer as to why 2x is recommended. :-) If we get a good answer, we'll update the FAQ wording and/or the limits at which that warning is displayed.
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/