On Apr 13, 2011, at 8:13 AM, Rushton Martin wrote:
> The bulk of our compute nodes are 8 cores (twin 4-core IBM x3550-m2).
> Jobs are submitted by Torque/MOAB. When run with up to np=8 there is
> good performance. Attempting to run with more processors brings
> problems, specifically if any one node of a group of nodes has all 8
> cores in use the job hangs. For instance running with 14 cores (7+7) is
> fine, but running with 16 (8+8) hangs.
>> From the FAQs I note the issues of over committing and aggressive
> scheduling. Is it possible for mpirun (or orted on the remote nodes) to
> be blocked from progressing by a fully committed node? We have a few
> x3755-m2 machines with 16 cores, and we have detected a similar issue
> with 16+16.
I'm not entirely sure I understand your notation, but we have never seen an issue when running with fully loaded nodes (i.e., where the number of MPI procs on the node = the number of cores).
What version of OMPI are you using? Are you binding the procs?
> Martin Rushton
> HPC System Manager, Weapons Technologies
> Tel: 01959 514777, Mobile: 07939 219057
> email: jmrushton_at_[hidden]
> QinetiQ - Delivering customer-focused solutions
> Please consider the environment before printing this email.
> This email and any attachments to it may be confidential and are
> intended solely for the use of the individual to whom it is
> addressed. If you are not the intended recipient of this email,
> you must neither take any action based upon its contents, nor
> copy or show it to anyone. Please contact the sender if you
> believe you have received this email in error. QinetiQ may
> monitor email traffic data and also the content of email for
> the purposes of security. QinetiQ Limited (Registered in England
> & Wales: Company Number: 3796233) Registered office: Cody Technology
> Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com.
> users mailing list