Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] memory per core/process
From: Reuti (reuti_at_[hidden])
Date: 2013-04-02 07:42:10


Hi,

Am 02.04.2013 um 13:22 schrieb Duke Nguyen:

> On 4/1/13 9:20 PM, Ralph Castain wrote:
>> It's probably the same problem - try running 'mpirun -npernode 1 -tag-output ulimit -a" on the remote nodes and see what it says. I suspect you'll find that they aren't correct.
>
> Somehow I could not run your advised CMD:
>
> $ qsub -l nodes=4:ppn=8 -I
> qsub: waiting for job 481.biobos to start
> qsub: job 481.biobos ready
>
> $ /usr/local/bin/mpirun -npernode 1 -tag-output ulimit -a
> --------------------------------------------------------------------------
> mpirun was unable to launch the specified application as it could not find an executable:

`ulimit` is a shell builtin:

$ type ulimit
ulimit is a shell builtin

It should work wit:

$ /usr/local/bin/mpirun -npernode 1 -tag-output sh -c "ulimit -a"

-- Reuti

> Executable: ulimit
> Node: node0108.biobos
>
> while attempting to start process rank 0.
> --------------------------------------------------------------------------
> 4 total processes failed to start
>
> But anyway, I figured out the reason. Yes, it is the cluster nodes that did not update ulimit settings (our system is a diskless node with warewulf so basically we have to update the vnfs and reboot all nodes before the nodes can run with new settings).
>
> Thanks for all the helps :)
>
> D.
>
>>
>> BTW: the "-tag-output'" option marks each line of output with the rank of the process. Since all the outputs will be interleaved, this will help you identify what came from each node.
>>
>>
>> On Mar 31, 2013, at 11:30 PM, Duke Nguyen <duke.lists_at_[hidden]> wrote:
>>
>>> On 3/31/13 12:20 AM, Duke Nguyen wrote:
>>>> I should really have asked earlier. Thanks for all the helps.
>>> I think I was excited too soon :). Increasing stacksize does help if I run a job in a dedicated server. Today I tried to modify the cluster (/etc/security/limits.conf, /etc/init.d/pbs_mom) and tried to run a different job with 4 nodes/8 core each (nodes=4:ppn=8), but I still get the mpirun error. My ulimit now reads:
>>>
>>> $ ulimit -a
>>> core file size (blocks, -c) 0
>>> data seg size (kbytes, -d) unlimited
>>> scheduling priority (-e) 0
>>> file size (blocks, -f) unlimited
>>> pending signals (-i) 8271027
>>> max locked memory (kbytes, -l) unlimited
>>> max memory size (kbytes, -m) unlimited
>>> open files (-n) 32768
>>> pipe size (512 bytes, -p) 8
>>> POSIX message queues (bytes, -q) 819200
>>> real-time priority (-r) 0
>>> stack size (kbytes, -s) unlimited
>>> cpu time (seconds, -t) unlimited
>>> max user processes (-u) 8192
>>> virtual memory (kbytes, -v) unlimited
>>> file locks (-x) unlimited
>>>
>>> Any other advice???
>>>
>>>> On 3/30/13 10:28 PM, Ralph Castain wrote:
>>>>> FWIW: there is an MCA param that helps with such problems:
>>>>>
>>>>> opal_set_max_sys_limits
>>>>> "Set to non-zero to automatically set any system-imposed limits to the maximum allowed",
>>>>>
>>>>> At the moment, it only sets the limits on number of files open, and max size of a file we can create. Easy enough to add the stack size, though as someone pointed out, it has some negatives as well.
>>>>>
>>>>>
>>>>> On Mar 30, 2013, at 7:35 AM, Gustavo Correa <gus_at_[hidden]> wrote:
>>>>>
>>>>>> On Mar 30, 2013, at 10:02 AM, Duke Nguyen wrote:
>>>>>>
>>>>>>> On 3/30/13 8:20 PM, Reuti wrote:
>>>>>>>> Am 30.03.2013 um 13:26 schrieb Tim Prince:
>>>>>>>>
>>>>>>>>> On 03/30/2013 06:36 AM, Duke Nguyen wrote:
>>>>>>>>>> On 3/30/13 5:22 PM, Duke Nguyen wrote:
>>>>>>>>>>> On 3/30/13 3:13 PM, Patrick Bégou wrote:
>>>>>>>>>>>> I do not know about your code but:
>>>>>>>>>>>>
>>>>>>>>>>>> 1) did you check stack limitations ? Typically intel fortran codes needs large amount of stack when the problem size increase.
>>>>>>>>>>>> Check ulimit -a
>>>>>>>>>>> First time I heard of stack limitations. Anyway, ulimit -a gives
>>>>>>>>>>>
>>>>>>>>>>> $ ulimit -a
>>>>>>>>>>> core file size (blocks, -c) 0
>>>>>>>>>>> data seg size (kbytes, -d) unlimited
>>>>>>>>>>> scheduling priority (-e) 0
>>>>>>>>>>> file size (blocks, -f) unlimited
>>>>>>>>>>> pending signals (-i) 127368
>>>>>>>>>>> max locked memory (kbytes, -l) unlimited
>>>>>>>>>>> max memory size (kbytes, -m) unlimited
>>>>>>>>>>> open files (-n) 1024
>>>>>>>>>>> pipe size (512 bytes, -p) 8
>>>>>>>>>>> POSIX message queues (bytes, -q) 819200
>>>>>>>>>>> real-time priority (-r) 0
>>>>>>>>>>> stack size (kbytes, -s) 10240
>>>>>>>>>>> cpu time (seconds, -t) unlimited
>>>>>>>>>>> max user processes (-u) 1024
>>>>>>>>>>> virtual memory (kbytes, -v) unlimited
>>>>>>>>>>> file locks (-x) unlimited
>>>>>>>>>>>
>>>>>>>>>>> So stack size is 10MB??? Does this one create problem? How do I change this?
>>>>>>>>>> I did $ ulimit -s unlimited to have stack size to be unlimited, and the job ran fine!!! So it looks like stack limit is the problem. Questions are:
>>>>>>>>>>
>>>>>>>>>> * how do I set this automatically (and permanently)?
>>>>>>>>>> * should I set all other ulimits to be unlimited?
>>>>>>>>>>
>>>>>>>>> In our environment, the only solution we found is to have mpirun run a script on each node which sets ulimit (as well as environment variables which are more convenient to set there than in the mpirun), before starting the executable. We had expert recommendations against this but no other working solution. It seems unlikely that you would want to remove any limits which work at default.
>>>>>>>>> Stack size unlimited in reality is not unlimited; it may be limited by a system limit or implementation. As we run up to 120 threads per rank and many applications have threadprivate data regions, ability to run without considering stack limit is the exception rather than the rule.
>>>>>>>> Even if I would be the only user on a cluster of machines, I would define this in any queuingsystem to set the limits for the job.
>>>>>>> Sorry if I dont get this correctly, but do you mean I should set this using Torque/Maui (our queuing manager) instead of the system itself (/etc/security/limits.conf and /etc/profile.d/)?
>>>>>> Hi Duke
>>>>>>
>>>>>> We do both.
>>>>>> Set memlock and stacksize to unlimited, and increase the maximum number of
>>>>>> open files in the pbs_mom script in /etc/init.d, and do the same in /etc/security/limits.conf.
>>>>>> This maybe an overzealous "belt and suspenders" policy, but it works.
>>>>>> As everybody else said, a small stacksize is a common cause of segmentation fault in
>>>>>> large codes.
>>>>>> Basically all codes that we run here have this problem, with too many
>>>>>> automatic arrays, structures, etc in functions and subroutines.
>>>>>> But also a small memlock is trouble for OFED/Infinband, and the small (default)
>>>>>> max number of open file handles may hit the limit easily if many programs
>>>>>> (or poorly written programs) are running in the same node.
>>>>>> The default Linux distribution limits don't seem to be tailored for HPC, I guess.
>>>>>>
>>>>>> I hope this helps,
>>>>>> Gus Correa
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>