Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Pak Lui (Pak.Lui_at_[hidden])
Date: 2007-06-25 09:44:54

sadfub_at_[hidden] wrote:
> Sorry for late reply, but I havent had access to the machine at the weekend.
>> I don't really know what this means. People have explained "loose"
>> vs. "tight" integration to me before, but since I'm not an SGE user,
>> the definitions always fall away.
> I *assume* loose coupled jobs, are just jobs, where the SGE find some
> nodes to process them and from then on, it doesn't care about anything
> in conjunction to the jobs. In contrast to tight coupled jobs, where the
> SGE take care for sub process which could spwan from the job and
> terminate them too in case of a failure, or take care of specified
> resources.
>> Based on your prior e-mail, it looks like you are always invoking
>> "ulimit" via "pdsh", even under SGE jobs. This is incorrect.
> why?
>> Can't you just submit an SGE job script that runs "ulimit"?
> #!/bin/csh -f
> #$ -N MPI_Job
> #$ -pe mpi 4
> hostname && ulimit -a
> ATM I'm quite confused: cause I want to use the c-shell, but ulimit is
> just for bash. The c-shell uses limit... hmm.. and SGE uses obviously
> bash, instead of my request for csh in the first line. But if I just use
> #!/bin/bash I get the same limits:
> -sh-3.00$ cat MPI_Job.o112116
> node02
> core file size (blocks, -c) unlimited
> data seg size (kbytes, -d) unlimited
> file size (blocks, -f) unlimited
> pending signals (-i) 1024
> max locked memory (kbytes, -l) 32
> max memory size (kbytes, -m) unlimited
> open files (-n) 1024
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> stack size (kbytes, -s) unlimited
> cpu time (seconds, -t) unlimited
> max user processes (-u) 139264
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
> oops => 32 kbytes... So this isn't OMPI's fault.

this looks like sge_execd isn't able to source the correct system
defaults from the limit.conf file after you applied the change. Maybe
you will need to restart the daemon?

>>>> What are the limits of the user that launches the SGE daemons? I.e.,
>>>> did the SGE daemons get started with proper "unlimited" limits? If
>>>> not, that could hamper SGE's ability to set the limits that you told
>>> The limits in /etc/security/limits.conf apply to all users (using a
>>> '*'), hence the SGE processes and deamons shouldn't have any limits.
>> Not really. limits.conf is not universally applied; it's a PAM
>> entity. So for daemons that start via /etc/init.d scripts (or
>> whatever the equivalent is on your system), PAM limits are not
>> necessarily applied. For example, I had to manually insert a "ulimit
>> -Hl unlimited" in the startup script for my SLURM daemons.
> Hmm, ATM there are some important jobs in the queue (started some MONTHS
> ago) so I cannot restart the daemon. Is there any other way than restart
> (with proper limits) for ensuring the limits of a process?

Would setting the limit in the ~/.cshrc workaround this? If neither this
or restarting sge_execd doesn't set the correct limit for you (after you
change the limit.conf), then I believe there is something wrong inside
sge_execd not setting the limits correctly.

> thanks for your great help :)
> _______________________________________________
> devel mailing list
> devel_at_[hidden]

- Pak Lui