Consider a job that will run with 28 processes. The user submits it
$ qsub -l nodes=4:ppn=7 ...
which reserves 7 cores on (in this case) each of x3550x014 x3550x015 and
x3550x016 x3550x020. Torque generates a file (PBS_NODEFILE) which lists
each node 7 times.
The mpirun command given within the batch script is:
$ mpirun -np 28 -machinefile $PBS_NODEFILE <executable>
This is what I would refer to as 7+7+7+7, and it runs fine.
The problem occurs if, for instance, a 24 core job is attempted. qsub
gets nodes=3:ppn=8 and mpirun has -np 24. The job is now running on
three nodes, using all eight cores on each node - 8+8+8. This sort of
job will eventually hang and has to be killed off.
Cores Nodes Ppn Result
----- ----- --- ------
8 1 any works
8 >1 1-7 works
8 >1 8 hangs
16 1 any works
16 >1 1-15 works
16 >1 16 hangs
We have also tried test jobs on 8+7 (or 7+8) with inconclusive results.
Some of the live jobs run for a month or more and cut down versions do
not model well.
HPC System Manager, Weapons Technologies
Tel: 01959 514777, Mobile: 07939 219057
QinetiQ - Delivering customer-focused solutions
Please consider the environment before printing this email.
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
Behalf Of Ralph Castain
Sent: 13 April 2011 15:34
To: Open MPI Users
Subject: Re: [OMPI users] Over committing?
On Apr 13, 2011, at 8:13 AM, Rushton Martin wrote:
> The bulk of our compute nodes are 8 cores (twin 4-core IBM x3550-m2).
> Jobs are submitted by Torque/MOAB. When run with up to np=8 there is
> good performance. Attempting to run with more processors brings
> problems, specifically if any one node of a group of nodes has all 8
> cores in use the job hangs. For instance running with 14 cores (7+7)
> is fine, but running with 16 (8+8) hangs.
>> From the FAQs I note the issues of over committing and aggressive
> scheduling. Is it possible for mpirun (or orted on the remote nodes)
> to be blocked from progressing by a fully committed node? We have a
> x3755-m2 machines with 16 cores, and we have detected a similar issue
> with 16+16.
I'm not entirely sure I understand your notation, but we have never seen
an issue when running with fully loaded nodes (i.e., where the number of
MPI procs on the node = the number of cores).
What version of OMPI are you using? Are you binding the procs?
This email and any attachments to it may be confidential and are
intended solely for the use of the individual to whom it is
addressed. If you are not the intended recipient of this email,
you must neither take any action based upon its contents, nor
copy or show it to anyone. Please contact the sender if you
believe you have received this email in error. QinetiQ may
monitor email traffic data and also the content of email for
the purposes of security. QinetiQ Limited (Registered in England
& Wales: Company Number: 3796233) Registered office: Cody Technology
Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com.