Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] memory per core/process
From: Duke Nguyen (duke.lists_at_[hidden])
Date: 2013-04-02 12:48:41


On 4/2/13 11:03 PM, Gus Correa wrote:
> On 04/02/2013 11:40 AM, Duke Nguyen wrote:
>> On 3/30/13 8:46 PM, Patrick Bégou wrote:
>>> Ok, so your problem is identified as a stack size problem. I went into
>>> these limitations using Intel fortran compilers on large data problems.
>>>
>>> First, it seems you can increase your stack size as "ulimit -s
>>> unlimited" works (you didn't enforce the system hard limit). The best
>>> way is to set this setting in your .bashrc file so it will works on
>>> every node.
>>> But setting it to unlimited may not be really safe. IE, if you got in
>>> a badly coded recursive function calling itself without a stop
>>> condition you can request all the system memory and crash the node. So
>>> set a large but limited value, it's safer.
>>>
>>
>> Now I feel the pain you mentioned :). With -s unlimited now some of our
>> nodes are easily down (completely) and needed to be hard reset!!!
>> (whereas we never had any node down like that before even with the
>> killed or badly coded jobs).
>>
>> Looking for a safer number of ulimit -s other than "unlimited" now... :(
>>
>
> In my opinion this is a trade off between who feels the pain.
> It can be you (sys admin) feeling the pain of having
> to power up offline nodes,
> or it could be the user feeling the pain for having
> her/his code killed by segmentation fault due to small memory
> available for the stack.

... in case that user is at a large institute that promises to provide
best service, unlimited resources/unlimited *everything* to end users.
If not, user should really think of how to make use the best of
available resources. Unfortunately many (most?) end users don't.

> There is only so much that can be done to make everybody happy.

So true... especially HPC resource is still luxurious here in Vietnam,
and we have a quite small (and not-so-strong) cluster.

> If you share the nodes among jobs, you could set the
> stack size limit to
> some part of the physical_memory divided by the number_of_cores,
> saving some memory for the OS etc beforehand.
> However, this can be a straitjacket for jobs that could run with
> a bit more memory, and won't because of this limit.
> If you do not share the nodes, then you could make stacksize
> closer to physical memory.

Great. Thanks for this advice Gus.

>
> Anyway, this is less of an OpenMPI than of a
> resource manager / queuing system conversation.

Yeah, and I have learned a lot other than just openmpi stuffs here :)

>
> Best,
> Gus Correa
>
>>> I'm managing a cluster and I always set a maximum value to stack size.
>>> I also limit the memory available for each core for system stability.
>>> If a user request only one of the 12 cores of a node he can only
>>> access 1/12 of the node memory amount. If he needs more memory he has
>>> to request 2 cores, even if he uses a sequential code. This avoid
>>> crashing jobs of other users on the same node with memory
>>> requirements. But this is not configured on your node.
>>>
>>> Duke Nguyen a écrit :
>>>> On 3/30/13 3:13 PM, Patrick Bégou wrote:
>>>>> I do not know about your code but:
>>>>>
>>>>> 1) did you check stack limitations ? Typically intel fortran codes
>>>>> needs large amount of stack when the problem size increase.
>>>>> Check ulimit -a
>>>>
>>>> First time I heard of stack limitations. Anyway, ulimit -a gives
>>>>
>>>> $ ulimit -a
>>>> core file size (blocks, -c) 0
>>>> data seg size (kbytes, -d) unlimited
>>>> scheduling priority (-e) 0
>>>> file size (blocks, -f) unlimited
>>>> pending signals (-i) 127368
>>>> max locked memory (kbytes, -l) unlimited
>>>> max memory size (kbytes, -m) unlimited
>>>> open files (-n) 1024
>>>> pipe size (512 bytes, -p) 8
>>>> POSIX message queues (bytes, -q) 819200
>>>> real-time priority (-r) 0
>>>> stack size (kbytes, -s) 10240
>>>> cpu time (seconds, -t) unlimited
>>>> max user processes (-u) 1024
>>>> virtual memory (kbytes, -v) unlimited
>>>> file locks (-x) unlimited
>>>>
>>>> So stack size is 10MB??? Does this one create problem? How do I
>>>> change this?
>>>>
>>>>>
>>>>> 2) did your node uses cpuset and memory limitation like fake numa to
>>>>> set the maximum amount of memory available for a job ?
>>>>
>>>> Not really understand (also first time heard of fake numa), but I am
>>>> pretty sure we do not have such things. The server I tried was a
>>>> dedicated server with 2 x5420 and 16GB physical memory.
>>>>
>>>>>
>>>>> Patrick
>>>>>
>>>>> Duke Nguyen a écrit :
>>>>>> Hi folks,
>>>>>>
>>>>>> I am sorry if this question had been asked before, but after ten
>>>>>> days of searching/working on the system, I surrender :(. We try to
>>>>>> use mpirun to run abinit (abinit.org) which in turns will call an
>>>>>> input file to run some simulation. The command to run is pretty
>>>>>> simple
>>>>>>
>>>>>> $ mpirun -np 4 /opt/apps/abinit/bin/abinit < input.files >&
>>>>>> output.log
>>>>>>
>>>>>> We ran this command on a server with two quad core x5420 and 16GB
>>>>>> of memory. I called only 4 core, and I guess in theory each of the
>>>>>> core should take up to 2GB each.
>>>>>>
>>>>>> In the output of the log, there is something about memory:
>>>>>>
>>>>>> P This job should need less than 717.175 Mbytes of memory.
>>>>>> Rough estimation (10% accuracy) of disk space for files :
>>>>>> WF disk file : 69.524 Mbytes ; DEN or POT disk file : 14.240 Mbytes.
>>>>>>
>>>>>> So basically it reported that the above job should not take more
>>>>>> than 718MB each core.
>>>>>>
>>>>>> But I still have the Segmentation Fault error:
>>>>>>
>>>>>> mpirun noticed that process rank 0 with PID 16099 on node biobos
>>>>>> exited on signal 11 (Segmentation fault).
>>>>>>
>>>>>> The system already has limits up to unlimited:
>>>>>>
>>>>>> $ cat /etc/security/limits.conf | grep -v '#'
>>>>>> * soft memlock unlimited
>>>>>> * hard memlock unlimited
>>>>>>
>>>>>> I also tried to run
>>>>>>
>>>>>> $ ulimit -l unlimited
>>>>>>
>>>>>> before the mpirun command above, but it did not help at all.
>>>>>>
>>>>>> If we adjust the parameters of the input.files to give the reported
>>>>>> mem per core is less than 512MB, then the job runs fine.
>>>>>>
>>>>>> Please help,
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> D.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>