Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Over committing?
From: Rushton Martin (JMRUSHTON_at_[hidden])
Date: 2011-04-14 04:46:00


I forwarded your question to the code custodian and received the
following reply (GRIM is the major code, the one which shows the
problem): "I've not tried the debugger but GRIM does have a number of
mpi_barrier calls in it so I would think we are safe there. There is of
course a performance downside with an over-use of barriers! As mentioned
in the e-trail."

Martin Rushton
HPC System Manager, Weapons Technologies
Tel: 01959 514777, Mobile: 07939 219057
email: jmrushton_at_[hidden]
www.QinetiQ.com
QinetiQ - Delivering customer-focused solutions

Please consider the environment before printing this email.
-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
Behalf Of Ralph Castain
Sent: 14 April 2011 04:55
To: Open MPI Users
Subject: Re: [OMPI users] Over committing?

Have you folks used a debugger such as TotalView or padb to look at
these stalls?

I ask because we discovered a long time ago that MPI collectives can
"hang" in the scenario you describe. It is caused by one rank falling
behind, and then never catching up due to resource allocations - i.e..,
once you fall behind due to the processor being used by something else,
you never catch up.

The code that causes this is generally a loop around a collective such
as Allreduce. The solution was to inject a "barrier" operation in the
loop periodically, thus ensuring that all ranks had an opportunity to
catch up.

There is an MCA param you can set that will inject the barrier - it
specifies to inject it every N collective operations (either before or
after the Nth op):

-mca coll_sync_barrier_before N

or

-mca coll_sync_barrier_after N

It'll slow the job down a little, depending upon how often you inject
the barrier. But it did allow us to run jobs reliably to completion when
the code involved such issues.

On Apr 13, 2011, at 10:07 AM, Rushton Martin wrote:

> The 16 cores refers to x3755-m2s. We have a mix of 3550s and 3755s in

> the cluster.
>
> It could be memory, but I think not. The jobs are well within memory
> capacity, and the memory is mainly static. If out of memory then the
> jobs would be first candidate for the job. Larger jobs run on the
> 3755s which as well as more memory have local disks for paging to.
>
>
> Martin Rushton
> HPC System Manager, Weapons Technologies
> Tel: 01959 514777, Mobile: 07939 219057
> email: jmrushton_at_[hidden]
> www.QinetiQ.com
> QinetiQ - Delivering customer-focused solutions
>
> Please consider the environment before printing this email.
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
> On Behalf Of Reuti
> Sent: 13 April 2011 16:53
> To: Open MPI Users
> Subject: Re: [OMPI users] Over committing?
>
> Am 13.04.2011 um 17:09 schrieb Rushton Martin:
>
>> Version 1.3.2
>>
>> Consider a job that will run with 28 processes. The user submits it
>> with:
>>
>> $ qsub -l nodes=4:ppn=7 ...
>>
>> which reserves 7 cores on (in this case) each of x3550x014 x3550x015
>> and
>> x3550x016 x3550x020. Torque generates a file (PBS_NODEFILE) which
>> lists each node 7 times.
>>
>> The mpirun command given within the batch script is:
>>
>> $ mpirun -np 28 -machinefile $PBS_NODEFILE <executable>
>>
>> This is what I would refer to as 7+7+7+7, and it runs fine.
>>
>> The problem occurs if, for instance, a 24 core job is attempted.
>> qsub
>
>> gets nodes=3:ppn=8 and mpirun has -np 24. The job is now running on
>> three nodes, using all eight cores on each node - 8+8+8. This sort
>> of
>
>> job will eventually hang and has to be killed off.
>>
>> Cores Nodes Ppn Result
>> ----- ----- --- ------
>> 8 1 any works
>> 8 >1 1-7 works
>> 8 >1 8 hangs
>> 16 1 any works
>> 16 >1 1-15 works
>> 16 >1 16 hangs
>
> How many cores do you have in each system? Looks like 8 is the maximum

> IBM offers from their datasheet, and still you can request 16 per
node?
>
> Can it be a memory porblem?
>
> -- Reuti
>
>
>> We have also tried test jobs on 8+7 (or 7+8) with inconclusive
> results.
>> Some of the live jobs run for a month or more and cut down versions
>> do
>
>> not model well.
>>
>> Martin Rushton
>> HPC System Manager, Weapons Technologies
>> Tel: 01959 514777, Mobile: 07939 219057
>> email: jmrushton_at_[hidden]
>> www.QinetiQ.com
>> QinetiQ - Delivering customer-focused solutions
>>
>> Please consider the environment before printing this email.
>> -----Original Message-----
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
>> On Behalf Of Ralph Castain
>> Sent: 13 April 2011 15:34
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Over committing?
>>
>>
>> On Apr 13, 2011, at 8:13 AM, Rushton Martin wrote:
>>
>>> The bulk of our compute nodes are 8 cores (twin 4-core IBM
x3550-m2).
>>> Jobs are submitted by Torque/MOAB. When run with up to np=8 there
>>> is
>
>>> good performance. Attempting to run with more processors brings
>>> problems, specifically if any one node of a group of nodes has all 8

>>> cores in use the job hangs. For instance running with 14 cores
>>> (7+7)
>
>>> is fine, but running with 16 (8+8) hangs.
>>>
>>>> From the FAQs I note the issues of over committing and aggressive
>>> scheduling. Is it possible for mpirun (or orted on the remote
>>> nodes)
>
>>> to be blocked from progressing by a fully committed node? We have a

>>> few
>>> x3755-m2 machines with 16 cores, and we have detected a similar
>>> issue
>
>>> with 16+16.
>>
>> I'm not entirely sure I understand your notation, but we have never
>> seen an issue when running with fully loaded nodes (i.e., where the
>> number of MPI procs on the node = the number of cores).
>>
>> What version of OMPI are you using? Are you binding the procs?
>> This email and any attachments to it may be confidential and are
>> intended solely for the use of the individual to whom it is
addressed.
>
>> If you are not the intended recipient of this email, you must neither

>> take any action based upon its contents, nor copy or show it to
>> anyone. Please contact the sender if you believe you have received
>> this email in error. QinetiQ may monitor email traffic data and also
>> the content of email for the purposes of security. QinetiQ Limited
>> (Registered in England & Wales: Company Number: 3796233) Registered
>> office: Cody Technology Park, Ively Road, Farnborough, Hampshire,
>> GU14
>
>> 0LX http://www.qinetiq.com.
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> The QinetiQ e-mail privacy policy and company information is detailed
elsewhere in the body of this email.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users
The QinetiQ e-mail privacy policy and company information is detailed elsewhere in the body of this email.