Afraid I have no idea - we regularly run on Torque machines with the nodes fully populated. While most runs are only for a few hours, some runs go for days.
How was OMPI configured? What OS version?
On Apr 13, 2011, at 9:09 AM, Rushton Martin wrote:
> Version 1.3.2
> Consider a job that will run with 28 processes. The user submits it
> $ qsub -l nodes=4:ppn=7 ...
> which reserves 7 cores on (in this case) each of x3550x014 x3550x015 and
> x3550x016 x3550x020. Torque generates a file (PBS_NODEFILE) which lists
> each node 7 times.
> The mpirun command given within the batch script is:
> $ mpirun -np 28 -machinefile $PBS_NODEFILE <executable>
> This is what I would refer to as 7+7+7+7, and it runs fine.
> The problem occurs if, for instance, a 24 core job is attempted. qsub
> gets nodes=3:ppn=8 and mpirun has -np 24. The job is now running on
> three nodes, using all eight cores on each node - 8+8+8. This sort of
> job will eventually hang and has to be killed off.
> Cores Nodes Ppn Result
> ----- ----- --- ------
> 8 1 any works
> 8 >1 1-7 works
> 8 >1 8 hangs
> 16 1 any works
> 16 >1 1-15 works
> 16 >1 16 hangs
> We have also tried test jobs on 8+7 (or 7+8) with inconclusive results.
> Some of the live jobs run for a month or more and cut down versions do
> not model well.
> Martin Rushton
> HPC System Manager, Weapons Technologies
> Tel: 01959 514777, Mobile: 07939 219057
> email: jmrushton_at_[hidden]
> QinetiQ - Delivering customer-focused solutions
> Please consider the environment before printing this email.
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
> Behalf Of Ralph Castain
> Sent: 13 April 2011 15:34
> To: Open MPI Users
> Subject: Re: [OMPI users] Over committing?
> On Apr 13, 2011, at 8:13 AM, Rushton Martin wrote:
>> The bulk of our compute nodes are 8 cores (twin 4-core IBM x3550-m2).
>> Jobs are submitted by Torque/MOAB. When run with up to np=8 there is
>> good performance. Attempting to run with more processors brings
>> problems, specifically if any one node of a group of nodes has all 8
>> cores in use the job hangs. For instance running with 14 cores (7+7)
>> is fine, but running with 16 (8+8) hangs.
>>> From the FAQs I note the issues of over committing and aggressive
>> scheduling. Is it possible for mpirun (or orted on the remote nodes)
>> to be blocked from progressing by a fully committed node? We have a
>> x3755-m2 machines with 16 cores, and we have detected a similar issue
>> with 16+16.
> I'm not entirely sure I understand your notation, but we have never seen
> an issue when running with fully loaded nodes (i.e., where the number of
> MPI procs on the node = the number of cores).
> What version of OMPI are you using? Are you binding the procs?
> This email and any attachments to it may be confidential and are
> intended solely for the use of the individual to whom it is
> addressed. If you are not the intended recipient of this email,
> you must neither take any action based upon its contents, nor
> copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error. QinetiQ may
> monitor email traffic data and also the content of email for
> the purposes of security. QinetiQ Limited (Registered in England
> & Wales: Company Number: 3796233) Registered office: Cody Technology
> Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com.
> users mailing list