Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Segfault on any MPI communication on head node
From: Vassenkov, Phillip (Phillip.Vassenkov_at_[hidden])
Date: 2011-09-23 17:27:26

Hey all,
I've been racking my brains over this for several days and was hoping anyone could enlighten me. I'll describe only the relevant parts of the network/computer systems. There is one head node and a multitude of regular nodes. The regular nodes are all identical to each other. If I run an mpi program from one of the regular nodes to any other regular nodes, everything works. If I include the head node in the hosts file, I get segfaults which I'll paste below along with sample code. The machines are all networked via infiniband and Ethernet. The issue only arises when mpi communication occurs. By this I mean, MPi_Init might succeed but the segfault always occurs on MPI_Barrier or MPI_send/recv. I found a work around by disabling the openib btl and enforcing that communications go over infiniband(if I don't force infiniband, it'll go over Ethernet). This command works when the head node is included in the hosts file:
mpirun --hostfile hostfile --mca btl ^openib --mca btl_tcp_if_include ib0 -np 2 ./b.out

Sample Code:
#include "mpi.h"
#include <stdio.h>
int main(int argc, char *argv[])
   int rank, nprocs;
    char* name[20];
    int maxlen = 20;
    printf("Hello, world. I am %d of %d and host %s \n", rank, nprocs,name);
    return 0;


[pastec:19917] *** Process received signal ***
[pastec:19917] Signal: Segmentation fault (11)
[pastec:19917] Signal code: Address not mapped (1)
[pastec:19917] Failing at address: 0x8
[pastec:19917] [ 0] /lib64/ [0x34a880eeb0]
[pastec:19917] [ 1] /usr/lib64/ [0x7eff6430b6aa]
[pastec:19917] [ 2] /usr/lib64/openmpi/lib/openmpi/ [0x7eff66a163c9]
[pastec:19917] [ 3] /usr/lib64/openmpi/lib/openmpi/ [0x7eff66a21b70]
[pastec:19917] [ 4] /usr/lib64/openmpi/lib/openmpi/ [0x7eff66a21c89]
[pastec:19917] [ 5] /usr/lib64/openmpi/lib/openmpi/ [0x7eff66a1703d]
[pastec:19917] [ 6] /usr/lib64/openmpi/lib/openmpi/ [0x7eff676670e6]
[pastec:19917] [ 7] /usr/lib64/openmpi/lib/openmpi/ [0x7eff6765b273]
[pastec:19917] [ 8] /usr/lib64/openmpi/lib/openmpi/ [0x7eff65539b2f]
[pastec:19917] [ 9] /usr/lib64/openmpi/lib/openmpi/ [0x7eff655425cf]
[pastec:19917] [10] /usr/lib64/openmpi/lib/ [0x3a54c4c94e]
[pastec:19917] [11] ./b.out(main+0x6e) [0x400a42]
[pastec:19917] [12] /lib64/ [0x34a841ee5d]
[pastec:19917] [13] ./b.out() [0x400919]
[pastec:19917] *** End of error message ***
[] [[18526,0],0]-[[18526,1],1] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
mpirun noticed that process rank 1 with PID 19917 on node exited on signal 11 (Segmentation fault).