Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Segmentation fault in mca_btl_tcp
From: Werner Van Geit (werner.vangeit.spam_at_[hidden])
Date: 2010-04-15 02:57:10


Hi,

We are using openmpi 1.4.1 on our cluster computer (in conjunction with Torque). One of our users has a problem with his jobs generating a segmentation fault on one of the slaves, this is the backtrace:

[cstone-00613:28461] *** Process received signal ***
[cstone-00613:28461] Signal: Segmentation fault (11)
[cstone-00613:28461] Signal code: (128)
[cstone-00613:28461] Failing at address: (nil)
[cstone-00613:28462] *** Process received signal ***
[cstone-00613:28462] Signal: Segmentation fault (11)
[cstone-00613:28462] Signal code: Address not mapped (1)
[cstone-00613:28462] Failing at address: (nil)
[cstone-00613:28461] [ 0] /lib64/libc.so.6 [0x2ba1933dce20]
[cstone-00613:28461] [ 1] /opt/openmpi-1.3/lib/openmpi/mca_btl_tcp.so [0x2ba19530ec7a]
[cstone-00613:28461] [ 2] /opt/openmpi-1.3/lib/openmpi/mca_btl_tcp.so [0x2ba19530d860]
[cstone-00613:28461] [ 3] /opt/openmpi/lib/libopen-pal.so.0 [0x2ba1938eb16b]
[cstone-00613:28461] [ 4] /opt/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2ba1938e072e]
[cstone-00613:28461] [ 5] /opt/openmpi/lib/libmpi.so.0 [0x2ba193621b38]
[cstone-00613:28461] [ 6] /opt/openmpi/lib/libmpi.so.0(PMPI_Wait+0x5b) [0x2ba19364c63b]
[cstone-00613:28461] [ 7] /opt/openmpi/lib/libmpi_f77.so.0(mpi_wait_+0x3a) [0x2ba192e98b8a]
[cstone-00613:28461] [ 8] ./roms [0x44976c]
[cstone-00613:28461] [ 9] ./roms [0x449d96]
[cstone-00613:28461] [10] ./roms [0x422708]
[cstone-00613:28461] [11] ./roms [0x402908]
[cstone-00613:28461] [12] ./roms [0x402467]
[cstone-00613:28461] [13] ./roms [0x46d20e]
[cstone-00613:28461] [14] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2ba1933ca164]
[cstone-00613:28461] [15] ./roms [0x401dd9]
[cstone-00613:28461] *** End of error message ***
[cstone-00613:28462] [ 0] /lib64/libc.so.6 [0x2b5d57db6e20]
[cstone-00613:28462] *** End of error message ***

The other slaves crash with:
[cstone-00612][[21785,1],35][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

Since this problem seems to be happening in the network part of MPI my guess is that there is, or something wrong with the network, or a bug in OpenMPI.
This same problem also appeared at the time that we were using openmpi 1.3

How could this problem be solved ?

(for more info about the system see attachments)

Thx,

Werner Van Geit


                 Package: Open MPI gpike_at_cstone-login2 Distribution
                Open MPI: 1.4.1
   Open MPI SVN revision: r22421
   Open MPI release date: Jan 14, 2010
                Open RTE: 1.4.1
   Open RTE SVN revision: r22421
   Open RTE release date: Jan 14, 2010
                    OPAL: 1.4.1
       OPAL SVN revision: r22421
       OPAL release date: Jan 14, 2010
            Ident string: 1.4.1
                  Prefix: /opt/openmpi-1.4.1
 Configured architecture: x86_64-unknown-linux-gnu
          Configure host: cstone-login2
           Configured by: gpike
           Configured on: Wed Feb 3 14:33:11 JST 2010
          Configure host: cstone-login2
                Built by: gpike
                Built on: Wed Feb 3 14:43:40 JST 2010
              Built host: cstone-login2
              C bindings: yes
            C++ bindings: yes
      Fortran77 bindings: yes (all)
      Fortran90 bindings: yes
 Fortran90 bindings size: small
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
      Fortran77 compiler: gfortran
  Fortran77 compiler abs: /usr/bin/gfortran
      Fortran90 compiler: gfortran
  Fortran90 compiler abs: /usr/bin/gfortran
             C profiling: yes
           C++ profiling: yes
     Fortran77 profiling: yes
     Fortran90 profiling: yes
          C++ exceptions: no
          Thread support: posix (mpi: no, progress: no)
           Sparse Groups: no
  Internal debug support: no
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
         libltdl support: yes
   Heterogeneous support: no
 mpirun default --prefix: no
         MPI I/O support: yes
       MPI_WTIME support: gettimeofday
Symbol visibility support: yes
   FT Checkpoint support: no (checkpoint thread: no)
           MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.4.1)
              MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.4.1)
           MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.4.1)
               MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.4.1)
               MCA carto: file (MCA v2.0, API v2.0, Component v1.4.1)
           MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.4.1)
           MCA maffinity: libnuma (MCA v2.0, API v2.0, Component v1.4.1)
               MCA timer: linux (MCA v2.0, API v2.0, Component v1.4.1)
         MCA installdirs: env (MCA v2.0, API v2.0, Component v1.4.1)
         MCA installdirs: config (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.4.1)
              MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.4.1)
           MCA allocator: basic (MCA v2.0, API v2.0, Component v1.4.1)
           MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.4.1)
                MCA coll: basic (MCA v2.0, API v2.0, Component v1.4.1)
                MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.4.1)
                MCA coll: inter (MCA v2.0, API v2.0, Component v1.4.1)
                MCA coll: self (MCA v2.0, API v2.0, Component v1.4.1)
                MCA coll: sm (MCA v2.0, API v2.0, Component v1.4.1)
                MCA coll: sync (MCA v2.0, API v2.0, Component v1.4.1)
                MCA coll: tuned (MCA v2.0, API v2.0, Component v1.4.1)
                  MCA io: romio (MCA v2.0, API v2.0, Component v1.4.1)
               MCA mpool: fake (MCA v2.0, API v2.0, Component v1.4.1)
               MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.4.1)
               MCA mpool: sm (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA pml: cm (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA pml: csum (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA pml: v (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA bml: r2 (MCA v2.0, API v2.0, Component v1.4.1)
              MCA rcache: vma (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA btl: self (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA btl: sm (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA btl: tcp (MCA v2.0, API v2.0, Component v1.4.1)
                MCA topo: unity (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA osc: rdma (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA iof: hnp (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA iof: orted (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA iof: tool (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA oob: tcp (MCA v2.0, API v2.0, Component v1.4.1)
                MCA odls: default (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA ras: slurm (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA ras: tm (MCA v2.0, API v2.0, Component v1.4.1)
               MCA rmaps: load_balance (MCA v2.0, API v2.0, Component v1.4.1)
               MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.4.1)
               MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.4.1)
               MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA rml: oob (MCA v2.0, API v2.0, Component v1.4.1)
              MCA routed: binomial (MCA v2.0, API v2.0, Component v1.4.1)
              MCA routed: direct (MCA v2.0, API v2.0, Component v1.4.1)
              MCA routed: linear (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA plm: rsh (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA plm: slurm (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA plm: tm (MCA v2.0, API v2.0, Component v1.4.1)
               MCA filem: rsh (MCA v2.0, API v2.0, Component v1.4.1)
              MCA errmgr: default (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA ess: env (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA ess: hnp (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA ess: singleton (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA ess: slurm (MCA v2.0, API v2.0, Component v1.4.1)
                 MCA ess: tool (MCA v2.0, API v2.0, Component v1.4.1)
             MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.4.1)
             MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.4.1)

eth0 Link encap:Ethernet HWaddr 00:22:19:90:B1:81
          inet addr:10.31.0.197 Bcast:10.31.255.255 Mask:255.255.0.0
          inet6 addr: fe80::222:19ff:fe90:b181/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:249910284 errors:0 dropped:0 overruns:0 frame:0
          TX packets:250591692 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:106681454651 (101739.3 Mb) TX bytes:106434608548 (101503.9 Mb)
          Interrupt:169 Memory:dc000000-dc012100

eth1 Link encap:Ethernet HWaddr 00:22:19:90:B1:83
          inet addr:10.30.0.197 Bcast:10.30.255.255 Mask:255.255.0.0
          inet6 addr: fe80::222:19ff:fe90:b183/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
          RX packets:297930813 errors:0 dropped:0 overruns:0 frame:0
          TX packets:356489683 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:182393942711 (173944.4 Mb) TX bytes:908060768518 (865994.2 Mb)
          Interrupt:185 Memory:da000000-da012100

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:25149 errors:0 dropped:0 overruns:0 frame:0
          TX packets:25149 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:25401111 (24.2 Mb) TX bytes:25401111 (24.2 Mb)