Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Can run OpenMPI testcode on 86 or fewer slots in cluster, but nothing more than that
From: Lane, William (William.Lane_at_[hidden])
Date: 2011-07-26 15:19:01


Ralph,

I can successfully run the MPI testcode via OpenMPI 1.3.3 on less than 87 slots w/both the btl_tcp_if_exclude and btl_tcp_if_include switches
passed to mpirun.

SGE always allocates the qsub jobs from the 24 slot nodes first -- up to the 96 slots that these 4 nodes have available (on the largeMem.q). The rest of the 602 slots are allocated
from 2 slot nodes (all.q). All requests of up to 96 slots are serviced by the largeMem.q nodes (which have 24 slots apiece). Anything over 96 slots is serviced first by the largeMem.q
nodes then by the all.q nodes.

Here's the PE that I'm using:

mpich PE (Parallel Environment) queue:

pe_name mpich
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args /opt/gridengine/mpi/stopmpi.sh
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE

Wouldn't the -bynode allocation be really inefficient? Does the -bynode switch imply only one slot is used on each node before it moves on to the next?

Thanks for your help Ralph. At least I have some ideas on where to look now.

-Bill
________________________________________
From: users-bounces_at_[hidden] [users-bounces_at_[hidden]] on behalf of Ralph Castain [rhc_at_[hidden]]
Sent: Tuesday, July 26, 2011 6:32 AM
To: Open MPI Users
Subject: Re: [OMPI users] Can run OpenMPI testcode on 86 or fewer slots in cluster, but nothing more than that

A few thoughts:

* including both btl_tcp_if_include and btl_tcp_if_exclude is problematic as they are mutually exclusive options. I'm not sure which one will take precedence. I would suggest only using one of them.

* the default mapping algorithm is byslot - i.e., OMPI will place procs on each node of the cluster until all slots on that node have been filled, and then moves to the next node. Depending on what you have in your machinefile, it is possible that all 88 procs are being placed on the first node. You might try spreading your procs across all nodes with -bynode on the cmd line, or check to ensure that the machinefile is correctly specifying the number of slots on each node. Note: OMPI will automatically read the SGE environment to get the host allocation, so the only reason for providing a machinefile is if you don't want the full allocation used.

* 88*88 = 7744. MPI transport connections are point-to-point - i.e., each proc opens a unique connection to another proc. If your procs are all winding up on the same node, for example, then the system will want at least 7744 file descriptors on that node, assuming your application does a complete wireup across all procs.

Updating to 1.4.3 would be a good idea as it is more stable, but it may not resolve this problem if the issue is one of the above.

HTH
Ralph

On Jul 25, 2011, at 11:23 PM, Lane, William wrote:

> Please help me resolve the following problems with a 306 node Rocks cluster using SGE. Please note I can run the
> job successfully on <87 slots, but not anymore than that.
>
> We're running SGE and I'm submitting my jobs via the SGE CLI utility qsub and the following lines from a script:
>
> mpirun -n $NSLOTS -machinefile $TMPDIR/machines --mca btl_tcp_if_include eth0 --mca btl_tcp_if_exclude eth1 --mca oob_tcp_if_exclude eth1 --mca opal_set_max_sys_limits 1 --mca pls_gridengine_verbose 1 /stf/billstst/ProcessColors2MPICH1
> echo "MPICH1 mpirun returned #?"
>
> eth1 is the connection to the Isilon NAS, where the object file is located.
>
> The error messages returned are of the form:
>
> WRT ORTE_ERROR_LOG: The system limit on number of pipes a process can open was reached
> WRT ORTE_ERROR_LOG: The system limit on number of network connections a process can open was reached in file oob_tcp.c at line 447
>
> We have increased the open file limit to 4096 from 1024, problem still exists.
>
> I can run the same test code via MPICH2 successfully on all 696 slots of the cluster, but I can't run the
> same code (compiled via OpenMPI version 1.3.3) on any more than 86 slots.
>
> Here's the details on the installed version of Open MPI:
>
> [root}# ./ompi_info
> Package: Open MPI root_at_[hidden] Distribution
> Open MPI: 1.3.3
> Open MPI SVN revision: r21666
> Open MPI release date: Jul 14, 2009
> Open RTE: 1.3.3
> Open RTE SVN revision: r21666
> Open RTE release date: Jul 14, 2009
> OPAL: 1.3.3
> OPAL SVN revision: r21666
> OPAL release date: Jul 14, 2009
> Ident string: 1.3.3
> Prefix: /opt/openmpi
> Configured architecture: x86_64-unknown-linux-gnu
> Configure host: build-x86-64.rocksclusters.org
> Configured by: root
> Configured on: Sat Dec 12 16:29:23 PST 2009
> Configure host: build-x86-64.rocksclusters.org
> Built by: bruno
> Built on: Sat Dec 12 16:42:52 PST 2009
> Built host: build-x86-64.rocksclusters.org
> C bindings: yes
> C++ bindings: yes
> Fortran77 bindings: yes (all)
> Fortran90 bindings: yes
> Fortran90 bindings size: small
> C compiler: gcc
> C compiler absolute: /usr/bin/gcc
> C++ compiler: g++
> C++ compiler absolute: /usr/bin/g++
> Fortran77 compiler: gfortran
> Fortran77 compiler abs: /usr/bin/gfortran
> Fortran90 compiler: gfortran
> Fortran90 compiler abs: /usr/bin/gfortran
> C profiling: yes
> C++ profiling: yes
> Fortran77 profiling: yes
> Fortran90 profiling: yes
> C++ exceptions: no
> Thread support: posix (mpi: no, progress: no)
> Sparse Groups: no
> Internal debug support: no
> MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
> libltdl support: yes
> Heterogeneous support: no
> mpirun default --prefix: no
> MPI I/O support: yes
> MPI_WTIME support: gettimeofday
> Symbol visibility support: yes
> FT Checkpoint support: no (checkpoint thread: no)
> MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.3)
> MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3)
> MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.3)
> MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.3)
> MCA carto: file (MCA v2.0, API v2.0, Component v1.3.3)
> MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.3)
> MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.3)
> MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.3)
> MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.3)
> MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.3)
> MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.3)
> MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.3)
> MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: self (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.3)
> MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.3)
> MCA io: romio (MCA v2.0, API v2.0, Component v1.3.3)
> MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.3)
> MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.3)
> MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.3)
> MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.3)
> MCA pml: v (MCA v2.0, API v2.0, Component v1.3.3)
> MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.3)
> MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.3)
> MCA btl: self (MCA v2.0, API v2.0, Component v1.3.3)
> MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.3)
> MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.3)
> MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.3)
> MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.3)
> MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.3)
> MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.3)
> MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.3)
> MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.3)
> MCA odls: default (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ras: slurm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.3.3)
> MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.3.3)
> MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.3.3)
> MCA rml: oob (MCA v2.0, API v2.0, Component v1.3.3)
> MCA routed: binomial (MCA v2.0, API v2.0, Component v1.3.3)
> MCA routed: direct (MCA v2.0, API v2.0, Component v1.3.3)
> MCA routed: linear (MCA v2.0, API v2.0, Component v1.3.3)
> MCA plm: rsh (MCA v2.0, API v2.0, Component v1.3.3)
> MCA plm: slurm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA filem: rsh (MCA v2.0, API v2.0, Component v1.3.3)
> MCA errmgr: default (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ess: env (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ess: hnp (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ess: singleton (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ess: slurm (MCA v2.0, API v2.0, Component v1.3.3)
> MCA ess: tool (MCA v2.0, API v2.0, Component v1.3.3)
> MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.3.3)
> MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.3.3)
>
> Would upgrading to the latest version of OpenMPI (1.4.3) resolve this issue?
>
> Thank you,
>
> -Bill Lane
> IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for your cooperation.
> IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by
> applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is STRICTLY PROHIBITED.
>
> If you have received this message in error, please notify us immediately
> by calling (310) 423-6428 and destroy the related message. Thank You for your cooperation.
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users
IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is STRICTLY PROHIBITED. If you have received this message in error, please notify us immediately by calling (310) 423-6428 and destroy the related message. Thank You for your cooperation.
IMPORTANT WARNING: This message is intended for the use of the person or entity to which it is addressed and may contain information that is privileged and confidential, the disclosure of which is governed by
applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this information is STRICTLY PROHIBITED.

If you have received this message in error, please notify us immediately
by calling (310) 423-6428 and destroy the related message. Thank You for your cooperation.