Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] memory per core/process
From: Gus Correa (gus_at_[hidden])
Date: 2013-04-02 12:03:13

On 04/02/2013 11:40 AM, Duke Nguyen wrote:
> On 3/30/13 8:46 PM, Patrick Bégou wrote:
>> Ok, so your problem is identified as a stack size problem. I went into
>> these limitations using Intel fortran compilers on large data problems.
>> First, it seems you can increase your stack size as "ulimit -s
>> unlimited" works (you didn't enforce the system hard limit). The best
>> way is to set this setting in your .bashrc file so it will works on
>> every node.
>> But setting it to unlimited may not be really safe. IE, if you got in
>> a badly coded recursive function calling itself without a stop
>> condition you can request all the system memory and crash the node. So
>> set a large but limited value, it's safer.
> Now I feel the pain you mentioned :). With -s unlimited now some of our
> nodes are easily down (completely) and needed to be hard reset!!!
> (whereas we never had any node down like that before even with the
> killed or badly coded jobs).
> Looking for a safer number of ulimit -s other than "unlimited" now... :(

In my opinion this is a trade off between who feels the pain.
It can be you (sys admin) feeling the pain of having
to power up offline nodes,
or it could be the user feeling the pain for having
her/his code killed by segmentation fault due to small memory
available for the stack.
There is only so much that can be done to make everybody happy.
If you share the nodes among jobs, you could set the
stack size limit to
some part of the physical_memory divided by the number_of_cores,
saving some memory for the OS etc beforehand.
However, this can be a straitjacket for jobs that could run with
a bit more memory, and won't because of this limit.
If you do not share the nodes, then you could make stacksize
closer to physical memory.

Anyway, this is less of an OpenMPI than of a
resource manager / queuing system conversation.

Gus Correa

>> I'm managing a cluster and I always set a maximum value to stack size.
>> I also limit the memory available for each core for system stability.
>> If a user request only one of the 12 cores of a node he can only
>> access 1/12 of the node memory amount. If he needs more memory he has
>> to request 2 cores, even if he uses a sequential code. This avoid
>> crashing jobs of other users on the same node with memory
>> requirements. But this is not configured on your node.
>> Duke Nguyen a écrit :
>>> On 3/30/13 3:13 PM, Patrick Bégou wrote:
>>>> I do not know about your code but:
>>>> 1) did you check stack limitations ? Typically intel fortran codes
>>>> needs large amount of stack when the problem size increase.
>>>> Check ulimit -a
>>> First time I heard of stack limitations. Anyway, ulimit -a gives
>>> $ ulimit -a
>>> core file size (blocks, -c) 0
>>> data seg size (kbytes, -d) unlimited
>>> scheduling priority (-e) 0
>>> file size (blocks, -f) unlimited
>>> pending signals (-i) 127368
>>> max locked memory (kbytes, -l) unlimited
>>> max memory size (kbytes, -m) unlimited
>>> open files (-n) 1024
>>> pipe size (512 bytes, -p) 8
>>> POSIX message queues (bytes, -q) 819200
>>> real-time priority (-r) 0
>>> stack size (kbytes, -s) 10240
>>> cpu time (seconds, -t) unlimited
>>> max user processes (-u) 1024
>>> virtual memory (kbytes, -v) unlimited
>>> file locks (-x) unlimited
>>> So stack size is 10MB??? Does this one create problem? How do I
>>> change this?
>>>> 2) did your node uses cpuset and memory limitation like fake numa to
>>>> set the maximum amount of memory available for a job ?
>>> Not really understand (also first time heard of fake numa), but I am
>>> pretty sure we do not have such things. The server I tried was a
>>> dedicated server with 2 x5420 and 16GB physical memory.
>>>> Patrick
>>>> Duke Nguyen a écrit :
>>>>> Hi folks,
>>>>> I am sorry if this question had been asked before, but after ten
>>>>> days of searching/working on the system, I surrender :(. We try to
>>>>> use mpirun to run abinit ( which in turns will call an
>>>>> input file to run some simulation. The command to run is pretty simple
>>>>> $ mpirun -np 4 /opt/apps/abinit/bin/abinit < input.files >& output.log
>>>>> We ran this command on a server with two quad core x5420 and 16GB
>>>>> of memory. I called only 4 core, and I guess in theory each of the
>>>>> core should take up to 2GB each.
>>>>> In the output of the log, there is something about memory:
>>>>> P This job should need less than 717.175 Mbytes of memory.
>>>>> Rough estimation (10% accuracy) of disk space for files :
>>>>> WF disk file : 69.524 Mbytes ; DEN or POT disk file : 14.240 Mbytes.
>>>>> So basically it reported that the above job should not take more
>>>>> than 718MB each core.
>>>>> But I still have the Segmentation Fault error:
>>>>> mpirun noticed that process rank 0 with PID 16099 on node biobos
>>>>> exited on signal 11 (Segmentation fault).
>>>>> The system already has limits up to unlimited:
>>>>> $ cat /etc/security/limits.conf | grep -v '#'
>>>>> * soft memlock unlimited
>>>>> * hard memlock unlimited
>>>>> I also tried to run
>>>>> $ ulimit -l unlimited
>>>>> before the mpirun command above, but it did not help at all.
>>>>> If we adjust the parameters of the input.files to give the reported
>>>>> mem per core is less than 512MB, then the job runs fine.
>>>>> Please help,
>>>>> Thanks,
>>>>> D.
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]