Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Glenn Carver (Glenn.Carver_at_[hidden])
Date: 2007-08-03 19:14:50


Hi Don,

>If the error message is about "privileged" memory, i.e. locked or

We don't actually get an error message. What we see is the system
gradually losing free memory whilst running batch jobs, until such
point where it begins swapping like mad and performance plummets
(this happens on all nodes). We are still investigating and I
wouldn't want to bother this list until we have a clearer idea of
what's going on. But oddly, when the job finishes, we don't seem to
get all the memory back (but a reboot fixes it). We are running
fortran codes (not renowned for mem. leaks) and haven't seen this
problem before on other systems we use, nor did we experience it with
Clustertools6, only with CT7, which is why we currently suspect
problems with the free_list growing too large.

>pinned memory, on Solaris you can increase the amount of available
>privileged memory by editing the /etc/project file on the nodes.
>
>Amount available (example of typical value is 900MB):
>% prctl -n project.max-device-locked-memory -i project default

Apologies, I'm not familiar with projects in solaris. If I run this
command I get:
# prctl -n project.max-device-locked-memory -i project default
prctl: default: No controllable process found in task, project, or zone.

If I run it for one of the processes on the parallel job I get:
# prctl -n project.max-device-locked-memory -i pid 6553
process: 6553: ./tomcat
NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT
project.max-device-locked-memory
         privileged 217MB - deny

The nodes are X4100s, dual cpu, dual core Opterons with 3.5Gb RAM.
Each node therefore runs 4 processes. All nodes are running Solaris
11/06 and up-to-date with patches.

>
>Edit /etc/project:
>Default line of interest :
> default:3::::
>
>Change to, for example 4GB :
> default:3::::project.max-device-locked-memory=(priv,4197152000,deny)
>
>What to set ompi_free_list_max to? By default each connection will
>post 8 recs, at 7 sends, 32 rdma writes and possibly a few internal control
>messages. Since these are pulling from the same free list I believe a
>sufficient value could be calculated as : 50 * (np - 1). Memory will
>still be consumed but this should lesson the amount of privileged memory
>required.

Thanks, I will give that a try. One question, is 'np' the no. of
processes on each node or the total processes for the job?

>Memory consumption is something Sun is actively investigating. What
>size job are you running?

Each process has a SIZE of just under 800Mb (RES is typically about
half, often less, never more).

>
>Not sure if this part of the issue but another possiblity, if the
>communication pattern of the MPI job is actually starving one
>connection out of memory you could try setting "--mca
>mpi_preconnect_all 1" and "--mca btl_udapl_max_eager_rdma_peers X",
>where X is equal to np. This will establish a connection between
>all processes in the job as well as create a channel for short
>messages to use rdma functionality. By establishing this channel
>to all connections before the MPI job starts up each peer connection
>will be gauranteed some amount of privilege memory over which it could
>potentially communicate. Of course you do take the hit of wireup time
>for all connections at MPI_Init.

That's a useful tip and may apply in our case as the code
configuration giving us trouble writes a lot of data to process 0 for
disk output.

Thanks,
                 Glenn

>
>-DON
>
>Brian Barrett wrote:
>
>>On Aug 2, 2007, at 4:22 PM, Glenn Carver wrote:
>>
>>
>>
>>>Hopefully an easy question to answer... is it possible to get at the
>>>values of mca parameters whilst a program is running? What I had in
>>>mind was either an open-mpi function to call which would print the
>>>current values of mca parameters or a function to call for specific
>>>mca parameters. I don't want to interrupt the running of the
>>>application.
>>>
>>>Bit of background. I have a large F90 application running with
> >>OpenMPI (as Sun Clustertools 7) on Opteron CPUs with an IB network.
>>>We're seeing swap thrashing occurring on some of the nodes at times
>>>and having searched the archives and read the FAQ believe we may be
>>>seeing the problem described in:
>>>http://www.open-mpi.org/community/lists/users/2007/01/2511.php
>>>where the udapl free list is growing to a point where lockable
>>>memory runs out.
>>>
>>>Problem is, I have no feel for the kinds of numbers that
>>>"btl_udapl_free_list_max" might safely get up to? Hence the request
>>>to print mca parameter values whilst the program is running to see if
>>>we can tie in high values of this parameter to when we're seeing swap
>>>thrashing.
>>>
>>>
>>
>>Good news, the answer is easy. Bad news is, it's not the one you
>>want. btl_udapl_free_list_max is the *greatest* the list will ever
>>be allowed to grow to, not it's current size. So if you don't
>>specify a value and use the default of -1, it will return -1 for the
>>life of the application, regardless of how big those free lists
>>actually get. If you specify value X, it'll return X for the lift of
>>the application, as well.
>>
>>There is not a good way for a user to find out the current size of a
>>free list or the largest it got for the life of an application
>>(currently those two will always be the same, but that's another
>>story). Your best bet is to set the parameter to some value (say,
>>128 or 256) and see if that helps with the swapping.
>>
>>
>>Brian
>>
>>
>>
>_______________________________________________
>users mailing list
>users_at_[hidden]
>http://www.open-mpi.org/mailman/listinfo.cgi/users