Hi,
We are using openmpi 1.4.1 on our cluster computer (in conjunction with Torque). One of our users has a problem with his jobs generating a segmentation fault on one of the slaves, this is the backtrace:
[cstone-00613:28461] *** Process received signal ***
[cstone-00613:28461] Signal: Segmentation fault (11)
[cstone-00613:28461] Signal code: (128)
[cstone-00613:28461] Failing at address: (nil)
[cstone-00613:28462] *** Process received signal ***
[cstone-00613:28462] Signal: Segmentation fault (11)
[cstone-00613:28462] Signal code: Address not mapped (1)
[cstone-00613:28462] Failing at address: (nil)
[cstone-00613:28461] [ 0] /lib64/libc.so.6 [0x2ba1933dce20]
[cstone-00613:28461] [ 1] /opt/openmpi-1.3/lib/openmpi/mca_btl_tcp.so [0x2ba19530ec7a]
[cstone-00613:28461] [ 2] /opt/openmpi-1.3/lib/openmpi/mca_btl_tcp.so [0x2ba19530d860]
[cstone-00613:28461] [ 3] /opt/openmpi/lib/libopen-pal.so.0 [0x2ba1938eb16b]
[cstone-00613:28461] [ 4] /opt/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2ba1938e072e]
[cstone-00613:28461] [ 5] /opt/openmpi/lib/libmpi.so.0 [0x2ba193621b38]
[cstone-00613:28461] [ 6] /opt/openmpi/lib/libmpi.so.0(PMPI_Wait+0x5b) [0x2ba19364c63b]
[cstone-00613:28461] [ 7] /opt/openmpi/lib/libmpi_f77.so.0(mpi_wait_+0x3a) [0x2ba192e98b8a]
[cstone-00613:28461] [ 8] ./roms [0x44976c]
[cstone-00613:28461] [ 9] ./roms [0x449d96]
[cstone-00613:28461] [10] ./roms [0x422708]
[cstone-00613:28461] [11] ./roms [0x402908]
[cstone-00613:28461] [12] ./roms [0x402467]
[cstone-00613:28461] [13] ./roms [0x46d20e]
[cstone-00613:28461] [14] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2ba1933ca164]
[cstone-00613:28461] [15] ./roms [0x401dd9]
[cstone-00613:28461] *** End of error message ***
[cstone-00613:28462] [ 0] /lib64/libc.so.6 [0x2b5d57db6e20]
[cstone-00613:28462] *** End of error message ***
The other slaves crash with:
[cstone-00612][[21785,1],35][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
Since this problem seems to be happening in the network part of MPI my guess is that there is, or something wrong with the network, or a bug in OpenMPI.
This same problem also appeared at the time that we were using openmpi 1.3
How could this problem be solved ?
(for more info about the system see attachments)
Thx,
Werner Van Geit
Package: Open MPI gpike_at_cstone-login2 Distribution
Open MPI: 1.4.1
Open MPI SVN revision: r22421
Open MPI release date: Jan 14, 2010
Open RTE: 1.4.1
Open RTE SVN revision: r22421
Open RTE release date: Jan 14, 2010
OPAL: 1.4.1
OPAL SVN revision: r22421
OPAL release date: Jan 14, 2010
Ident string: 1.4.1
Prefix: /opt/openmpi-1.4.1
Configured architecture: x86_64-unknown-linux-gnu
Configure host: cstone-login2
Configured by: gpike
Configured on: Wed Feb 3 14:33:11 JST 2010
Configure host: cstone-login2
Built by: gpike
Built on: Wed Feb 3 14:43:40 JST 2010
Built host: cstone-login2
C bindings: yes
C++ bindings: yes
Fortran77 bindings: yes (all)
Fortran90 bindings: yes
Fortran90 bindings size: small
C compiler: gcc
C compiler absolute: /usr/bin/gcc
C++ compiler: g++
C++ compiler absolute: /usr/bin/g++
Fortran77 compiler: gfortran
Fortran77 compiler abs: /usr/bin/gfortran
Fortran90 compiler: gfortran
Fortran90 compiler abs: /usr/bin/gfortran
C profiling: yes
C++ profiling: yes
Fortran77 profiling: yes
Fortran90 profiling: yes
C++ exceptions: no
Thread support: posix (mpi: no, progress: no)
Sparse Groups: no
Internal debug support: no
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
libltdl support: yes
Heterogeneous support: no
mpirun default --prefix: no
MPI I/O support: yes
MPI_WTIME support: gettimeofday
Symbol visibility support: yes
FT Checkpoint support: no (checkpoint thread: no)
MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.4.1)
MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.4.1)
MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.4.1)
MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.4.1)
MCA carto: file (MCA v2.0, API v2.0, Component v1.4.1)
MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4.1)
MCA maffinity: libnuma (MCA v2.0, API v2.0, Component v1.4.1)
MCA timer: linux (MCA v2.0, API v2.0, Component v1.4.1)
MCA installdirs: env (MCA v2.0, API v2.0, Component v1.4.1)
MCA installdirs: config (MCA v2.0, API v2.0, Component v1.4.1)
MCA dpm: orte (MCA v2.0, API v2.0, Component v1.4.1)
MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.4.1)
MCA allocator: basic (MCA v2.0, API v2.0, Component v1.4.1)
MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.4.1)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.4.1)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.4.1)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.4.1)
MCA coll: self (MCA v2.0, API v2.0, Component v1.4.1)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.4.1)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.4.1)
MCA coll: tuned (MCA v2.0, API v2.0, Component v1.4.1)
MCA io: romio (MCA v2.0, API v2.0, Component v1.4.1)
MCA mpool: fake (MCA v2.0, API v2.0, Component v1.4.1)
MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.4.1)
MCA mpool: sm (MCA v2.0, API v2.0, Component v1.4.1)
MCA pml: cm (MCA v2.0, API v2.0, Component v1.4.1)
MCA pml: csum (MCA v2.0, API v2.0, Component v1.4.1)
MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.4.1)
MCA pml: v (MCA v2.0, API v2.0, Component v1.4.1)
MCA bml: r2 (MCA v2.0, API v2.0, Component v1.4.1)
MCA rcache: vma (MCA v2.0, API v2.0, Component v1.4.1)
MCA btl: self (MCA v2.0, API v2.0, Component v1.4.1)
MCA btl: sm (MCA v2.0, API v2.0, Component v1.4.1)
MCA btl: tcp (MCA v2.0, API v2.0, Component v1.4.1)
MCA topo: unity (MCA v2.0, API v2.0, Component v1.4.1)
MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.4.1)
MCA osc: rdma (MCA v2.0, API v2.0, Component v1.4.1)
MCA iof: hnp (MCA v2.0, API v2.0, Component v1.4.1)
MCA iof: orted (MCA v2.0, API v2.0, Component v1.4.1)
MCA iof: tool (MCA v2.0, API v2.0, Component v1.4.1)
MCA oob: tcp (MCA v2.0, API v2.0, Component v1.4.1)
MCA odls: default (MCA v2.0, API v2.0, Component v1.4.1)
MCA ras: slurm (MCA v2.0, API v2.0, Component v1.4.1)
MCA ras: tm (MCA v2.0, API v2.0, Component v1.4.1)
MCA rmaps: load_balance (MCA v2.0, API v2.0, Component v1.4.1)
MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.4.1)
MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.4.1)
MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.4.1)
MCA rml: oob (MCA v2.0, API v2.0, Component v1.4.1)
MCA routed: binomial (MCA v2.0, API v2.0, Component v1.4.1)
MCA routed: direct (MCA v2.0, API v2.0, Component v1.4.1)
MCA routed: linear (MCA v2.0, API v2.0, Component v1.4.1)
MCA plm: rsh (MCA v2.0, API v2.0, Component v1.4.1)
MCA plm: slurm (MCA v2.0, API v2.0, Component v1.4.1)
MCA plm: tm (MCA v2.0, API v2.0, Component v1.4.1)
MCA filem: rsh (MCA v2.0, API v2.0, Component v1.4.1)
MCA errmgr: default (MCA v2.0, API v2.0, Component v1.4.1)
MCA ess: env (MCA v2.0, API v2.0, Component v1.4.1)
MCA ess: hnp (MCA v2.0, API v2.0, Component v1.4.1)
MCA ess: singleton (MCA v2.0, API v2.0, Component v1.4.1)
MCA ess: slurm (MCA v2.0, API v2.0, Component v1.4.1)
MCA ess: tool (MCA v2.0, API v2.0, Component v1.4.1)
MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.4.1)
MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.4.1)
eth0 Link encap:Ethernet HWaddr 00:22:19:90:B1:81
inet addr:10.31.0.197 Bcast:10.31.255.255 Mask:255.255.0.0
inet6 addr: fe80::222:19ff:fe90:b181/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:249910284 errors:0 dropped:0 overruns:0 frame:0
TX packets:250591692 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:106681454651 (101739.3 Mb) TX bytes:106434608548 (101503.9 Mb)
Interrupt:169 Memory:dc000000-dc012100
eth1 Link encap:Ethernet HWaddr 00:22:19:90:B1:83
inet addr:10.30.0.197 Bcast:10.30.255.255 Mask:255.255.0.0
inet6 addr: fe80::222:19ff:fe90:b183/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:297930813 errors:0 dropped:0 overruns:0 frame:0
TX packets:356489683 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:182393942711 (173944.4 Mb) TX bytes:908060768518 (865994.2 Mb)
Interrupt:185 Memory:da000000-da012100
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:25149 errors:0 dropped:0 overruns:0 frame:0
TX packets:25149 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:25401111 (24.2 Mb) TX bytes:25401111 (24.2 Mb)
|