Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Over committing?
From: Rushton Martin (JMRUSHTON_at_[hidden])
Date: 2011-04-13 12:04:23


Inline

-----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
On Behalf Of Stergiou, Jonathan C CIV NSWCCD West Bethesda,6640
> Sent: 13 April 2011 16:52
> To: Open MPI Users
> Subject: Re: [OMPI users] Over committing?
>
> Martin,
>
> We have seen similar behavior when using certain codes. CodeA can run
at ppn=8 with no problem, but CodeB will run much more slowly (or hang)
> with ppn=8; instead we use ppn=7 for CodeB.

That is just what we have to do, but it does mean we are wasting 12.5%
of our CPU resource, something I'd rather not do permanently.

> This becomes complicated when we run CodeA and CodeB together (coupled
simulations). It requires a bit of fancy language in the Torque
> script, but we are able to get these coupled jobs to run successfully.

> Question: do you see this "ppn=8 hanging behavior" on every parallel
code you run, or only on specific applications? Do you see it with and
> without Torque? Can you try running ppn=7 and ppn=8 on a simple mpi
code?

We only run one major code, the other codes run as single process and so
we do not see the problem. When (if) the load comes down I can try
simpler jobs with ppn=8, but the hang often occurs a day or so into the
run. I don't want to stop productive work with non-productive cycle
stealers! It is worth noting that the main application is very CPU
intensive, it is FORTRAN code, so most of the memory is static, indeed
most of the memory is configured as a humungous one-dimensional array.
Very old school but what the heck, it works well when it's not hung.

-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
Behalf Of Rushton Martin
Sent: Wednesday, April 13, 2011 11:29
To: Open MPI Users
Subject: Re: [OMPI users] Over committing?

I'm afraid I can't comment on how OMPI was configured, "as supplied by
the suppliers"! The users experiencing these problems use the Intel
bindings, loaded via the modules command. We are running CentOS 5.3.

Martin Rushton
HPC System Manager, Weapons Technologies
Tel: 01959 514777, Mobile: 07939 219057
email: jmrushton_at_[hidden]
www.QinetiQ.com
QinetiQ - Delivering customer-focused solutions

Please consider the environment before printing this email.
-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
Behalf Of Ralph Castain
Sent: 13 April 2011 16:21
To: Open MPI Users
Subject: Re: [OMPI users] Over committing?

Afraid I have no idea - we regularly run on Torque machines with the
nodes fully populated. While most runs are only for a few hours, some
runs go for days.

How was OMPI configured? What OS version?

On Apr 13, 2011, at 9:09 AM, Rushton Martin wrote:

> Version 1.3.2
>
> Consider a job that will run with 28 processes. The user submits it
> with:
>
> $ qsub -l nodes=4:ppn=7 ...
>
> which reserves 7 cores on (in this case) each of x3550x014 x3550x015
> and
> x3550x016 x3550x020. Torque generates a file (PBS_NODEFILE) which
> lists each node 7 times.
>
> The mpirun command given within the batch script is:
>
> $ mpirun -np 28 -machinefile $PBS_NODEFILE <executable>
>
> This is what I would refer to as 7+7+7+7, and it runs fine.
>
> The problem occurs if, for instance, a 24 core job is attempted. qsub

> gets nodes=3:ppn=8 and mpirun has -np 24. The job is now running on
> three nodes, using all eight cores on each node - 8+8+8. This sort of

> job will eventually hang and has to be killed off.
>
> Cores Nodes Ppn Result
> ----- ----- --- ------
> 8 1 any works
> 8 >1 1-7 works
> 8 >1 8 hangs
> 16 1 any works
> 16 >1 1-15 works
> 16 >1 16 hangs
>
> We have also tried test jobs on 8+7 (or 7+8) with inconclusive
results.
> Some of the live jobs run for a month or more and cut down versions do

> not model well.
>
> Martin Rushton
> HPC System Manager, Weapons Technologies
> Tel: 01959 514777, Mobile: 07939 219057
> email: jmrushton_at_[hidden]
> www.QinetiQ.com
> QinetiQ - Delivering customer-focused solutions
>
> Please consider the environment before printing this email.
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
> On Behalf Of Ralph Castain
> Sent: 13 April 2011 15:34
> To: Open MPI Users
> Subject: Re: [OMPI users] Over committing?
>
>
> On Apr 13, 2011, at 8:13 AM, Rushton Martin wrote:
>
>> The bulk of our compute nodes are 8 cores (twin 4-core IBM x3550-m2).
>> Jobs are submitted by Torque/MOAB. When run with up to np=8 there is

>> good performance. Attempting to run with more processors brings
>> problems, specifically if any one node of a group of nodes has all 8
>> cores in use the job hangs. For instance running with 14 cores (7+7)

>> is fine, but running with 16 (8+8) hangs.
>>
>>> From the FAQs I note the issues of over committing and aggressive
>> scheduling. Is it possible for mpirun (or orted on the remote nodes)

>> to be blocked from progressing by a fully committed node? We have a
>> few
>> x3755-m2 machines with 16 cores, and we have detected a similar issue

>> with 16+16.
>
> I'm not entirely sure I understand your notation, but we have never
> seen an issue when running with fully loaded nodes (i.e., where the
> number of MPI procs on the node = the number of cores).
>
> What version of OMPI are you using? Are you binding the procs?
> This email and any attachments to it may be confidential and are
> intended solely for the use of the individual to whom it is addressed.

> If you are not the intended recipient of this email, you must neither
> take any action based upon its contents, nor copy or show it to
> anyone. Please contact the sender if you believe you have received
> this email in error. QinetiQ may monitor email traffic data and also
> the content of email for the purposes of security. QinetiQ Limited
> (Registered in England & Wales: Company Number: 3796233) Registered
> office: Cody Technology Park, Ively Road, Farnborough, Hampshire, GU14

> 0LX http://www.qinetiq.com.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users
The QinetiQ e-mail privacy policy and company information is detailed
elsewhere in the body of this email.

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users
The QinetiQ e-mail privacy policy and company information is detailed elsewhere in the body of this email.