Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault in mca_btl_tcp
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2010-04-15 06:41:17


Can you send a small program that reproduces the problem, perchance?

-jms
Sent from my PDA. No type good.

----- Original Message -----
From: users-bounces_at_[hidden] <users-bounces_at_[hidden]>
To: users_at_[hidden] <users_at_[hidden]>
Sent: Thu Apr 15 01:57:10 2010
Subject: [OMPI users] Segmentation fault in mca_btl_tcp

Hi,

We are using openmpi 1.4.1 on our cluster computer (in conjunction with Torque). One of our users has a problem with his jobs generating a segmentation fault on one of the slaves, this is the backtrace:

[cstone-00613:28461] *** Process received signal ***
[cstone-00613:28461] Signal: Segmentation fault (11)
[cstone-00613:28461] Signal code: (128)
[cstone-00613:28461] Failing at address: (nil)
[cstone-00613:28462] *** Process received signal ***
[cstone-00613:28462] Signal: Segmentation fault (11)
[cstone-00613:28462] Signal code: Address not mapped (1)
[cstone-00613:28462] Failing at address: (nil)
[cstone-00613:28461] [ 0] /lib64/libc.so.6 [0x2ba1933dce20]
[cstone-00613:28461] [ 1] /opt/openmpi-1.3/lib/openmpi/mca_btl_tcp.so [0x2ba19530ec7a]
[cstone-00613:28461] [ 2] /opt/openmpi-1.3/lib/openmpi/mca_btl_tcp.so [0x2ba19530d860]
[cstone-00613:28461] [ 3] /opt/openmpi/lib/libopen-pal.so.0 [0x2ba1938eb16b]
[cstone-00613:28461] [ 4] /opt/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2ba1938e072e]
[cstone-00613:28461] [ 5] /opt/openmpi/lib/libmpi.so.0 [0x2ba193621b38]
[cstone-00613:28461] [ 6] /opt/openmpi/lib/libmpi.so.0(PMPI_Wait+0x5b) [0x2ba19364c63b]
[cstone-00613:28461] [ 7] /opt/openmpi/lib/libmpi_f77.so.0(mpi_wait_+0x3a) [0x2ba192e98b8a]
[cstone-00613:28461] [ 8] ./roms [0x44976c]
[cstone-00613:28461] [ 9] ./roms [0x449d96]
[cstone-00613:28461] [10] ./roms [0x422708]
[cstone-00613:28461] [11] ./roms [0x402908]
[cstone-00613:28461] [12] ./roms [0x402467]
[cstone-00613:28461] [13] ./roms [0x46d20e]
[cstone-00613:28461] [14] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2ba1933ca164]
[cstone-00613:28461] [15] ./roms [0x401dd9]
[cstone-00613:28461] *** End of error message ***
[cstone-00613:28462] [ 0] /lib64/libc.so.6 [0x2b5d57db6e20]
[cstone-00613:28462] *** End of error message ***

The other slaves crash with:
[cstone-00612][[21785,1],35][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

Since this problem seems to be happening in the network part of MPI my guess is that there is, or something wrong with the network, or a bug in OpenMPI.
This same problem also appeared at the time that we were using openmpi 1.3

How could this problem be solved ?

(for more info about the system see attachments)

Thx,

Werner Van Geit