Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE
From: Joshua Baker-LePain (jlb17_at_[hidden])
Date: 2012-03-15 13:14:54


On Thu, 15 Mar 2012 at 1:53pm, Reuti wrote

> PS: In your example you also had the case 2 slots in the low priority
> queue, what is the actual setup in your cluster?

Our actual setup is:

  o lab.q, slots=numprocs, load_thresholds=np_load_avg=1.5, labs (=SGE
    projects) limited by RQS to a number of slots equal to their "share" of
    the cluster, seq_no=0, priority=0.

  o long.q, slots=numprocs, load_thresholds=np_load_avg=0.9, seq_no=1,
    priority=19

  o short.q, slots=numprocs, load_thresholds=np_load_avg=1.25, users
    limited by RQS to 200 slots, runtime limited to 30 minutes, seq_no=2,
    priority=10

Users are instructed to not select a queue when submitting jobs. The
theory is that even if non-contributing users have filled the cluster with
long.q jobs, contributing users will still have instant access to "their"
lab.q slots, overloading nodes with jobs running at a higher priority than
the long.q jobs. long.q jobs won't start on nodes full of lab.q jobs.
And short.q is for quick, high priority jobs regardless of cluster status
(the main use case being processing MRI data into images while a patient
is physically in the scanner).

The truth is our cluster is primarily used for, and thus SGE is tuned for,
large numbers of serial jobs. We do have *some* folks running parallel
code, and it *is* starting to get to the point where I need to reconfigure
things to make that part work better.

-- 
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF