Tim Mattox wrote:
> For your runs with Open MPI over InfiniBand, try using openib,sm,self
> for the BTL setting, so that shared memory communications are used
> within a node. It would give us another datapoint to help diagnose
> the problem. As for other things we would need to help diagnose the
> problem, please follow the advice on this FAQ entry, and the help page:
> http://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot
> http://www.open-mpi.org/community/help/
>
Dear Tim,
thank you for this pointer.
1) Ofed: It's 1.2.5, from the OpenFabrics website
2) Linux version: scientific linux (RH enterprise remaster) v. 4.2,
kernel 2.6.9-55.0.12.ELsmp
3) Subnet manager: OpenSM
4)ibv_devinfo
hca_id: mthca0
fw_ver: 1.0.800
node_guid: 0002:c902:0022:b398
sys_image_guid: 0002:c902:0022:b39b
vendor_id: 0x02c9
vendor_part_id: 25204
hw_ver: 0xA0
board_id: MT_03B0120002
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 9
port_lid: 97
port_lmc: 0x00
(no node is different from the others, as far as the problem is concerned)
5) ifconfig:
eth0 Link encap:Ethernet HWaddr 00:17:31:E3:89:4A
inet addr:10.0.0.12 Bcast:10.0.0.255 Mask:255.255.255.0
inet6 addr: fe80::217:31ff:fee3:894a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:23348585 errors:0 dropped:0 overruns:0 frame:0
TX packets:17247486 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:19410724189 (18.0 GiB) TX bytes:14981325997 (13.9 GiB)
Interrupt:209
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:5088 errors:0 dropped:0 overruns:0 frame:0
TX packets:5088 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2468843 (2.3 MiB) TX bytes:2468843 (2.3 MiB)
6) ulimit -l
8388608
(this is more than the physical memory on the node)
7) output of ompi_info attached (I have tried also earlier releases)
8) description of the problem: a program seems to communicate correctly
over the TCP network, but not over the ifiniband network. The program is
structured in such a way that if the communication does not happen, a
loop become infinite. So there is no error message, just a program
entering an infinite loop.
The command line used are:
The command line I use is
mpirun -mca btl openib,sm,self <executable>
(with openib replaced by tcp in the case of communication over ethernet).
I could include the path and the value of the variable LD_LIBRARY_PATH,
but it won't tell too much, since the installation directory is
non-standard (/opt/ompi128-intel/bin for the path and
/opt/ompi128-intel/lib for the libs).
I hope to have provided all the required info, if you need more or some
of them in more detail, please let me know.
Many thanks,
Biagio Lucini
Open MPI: 1.2.8
Open MPI SVN revision: r19718
Open RTE: 1.2.8
Open RTE SVN revision: r19718
OPAL: 1.2.8
OPAL SVN revision: r19718
Prefix: /opt/ompi128-intel
Configured architecture: x86_64-unknown-linux-gnu
Configured by: root
Configured on: Tue Dec 23 12:33:51 GMT 2008
Configure host: master.cluster
Built by: root
Built on: Tue Dec 23 12:38:34 GMT 2008
Built host: master.cluster
C bindings: yes
C++ bindings: yes
Fortran77 bindings: yes (all)
Fortran90 bindings: yes
Fortran90 bindings size: small
C compiler: icc
C compiler absolute: /opt/intel/cce/9.1.045/bin/icc
C++ compiler: icpc
C++ compiler absolute: /opt/intel/cce/9.1.045/bin/icpc
Fortran77 compiler: ifort
Fortran77 compiler abs: /opt/intel/fce/9.1.040/bin/ifort
Fortran90 compiler: ifort
Fortran90 compiler abs: /opt/intel/fce/9.1.040/bin/ifort
C profiling: yes
C++ profiling: yes
Fortran77 profiling: yes
Fortran90 profiling: yes
C++ exceptions: no
Thread support: posix (mpi: no, progress: no)
Internal debug support: no
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
libltdl support: yes
Heterogeneous support: yes
mpirun default --prefix: no
MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.8)
MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.8)
MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.8)
MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.8)
MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.2.8)
MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.8)
MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.8)
MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.8)
MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.8)
MCA coll: self (MCA v1.0, API v1.0, Component v1.2.8)
MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.8)
MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.8)
MCA io: romio (MCA v1.0, API v1.0, Component v1.2.8)
MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.8)
MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.8)
MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.8)
MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.8)
MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.8)
MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.8)
MCA btl: openib (MCA v1.0, API v1.0.1, Component v1.2.8)
MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.8)
MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.8)
MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.8)
MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.8)
MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.8)
MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.8)
MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.8)
MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.8)
MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.8)
MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2.8)
MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2.8)
MCA iof: svc (MCA v1.0, API v1.0, Component v1.2.8)
MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2.8)
MCA ns: replica (MCA v1.0, API v2.0, Component v1.2.8)
MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2.8)
MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.8)
MCA ras: localhost (MCA v1.0, API v1.3, Component v1.2.8)
MCA ras: slurm (MCA v1.0, API v1.3, Component v1.2.8)
MCA rds: hostfile (MCA v1.0, API v1.3, Component v1.2.8)
MCA rds: proxy (MCA v1.0, API v1.3, Component v1.2.8)
MCA rds: resfile (MCA v1.0, API v1.3, Component v1.2.8)
MCA rmaps: round_robin (MCA v1.0, API v1.3, Component v1.2.8)
MCA rmgr: proxy (MCA v1.0, API v2.0, Component v1.2.8)
MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.2.8)
MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.8)
MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.8)
MCA pls: proxy (MCA v1.0, API v1.3, Component v1.2.8)
MCA pls: rsh (MCA v1.0, API v1.3, Component v1.2.8)
MCA pls: slurm (MCA v1.0, API v1.3, Component v1.2.8)
MCA sds: env (MCA v1.0, API v1.0, Component v1.2.8)
MCA sds: pipe (MCA v1.0, API v1.0, Component v1.2.8)
MCA sds: seed (MCA v1.0, API v1.0, Component v1.2.8)
MCA sds: singleton (MCA v1.0, API v1.0, Component v1.2.8)
MCA sds: slurm (MCA v1.0, API v1.0, Component v1.2.8)
|