Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] intermittent segfaults with openib on ring_c.c
From: Fischer, Greg A. (fischega_at_[hidden])
Date: 2014-06-03 12:38:19

Hello openmpi-users,

I'm running into a perplexing problem on a new system, whereby I'm experiencing intermittent segmentation faults when I run the ring_c.c example and use the openib BTL. See an example below. Approximately 50% of the time it provides the expected output, but the other 50% of the time, it segfaults. LD_LIBRARY_PATH is set correctly, and the version of "mpirun" being invoked is correct. The output of ompi_info -all is attached.

One potential problem may be that the system that OpenMPI was compiled on is mostly the same as the system where it is being executed, but there are some differences in the installed packages. I've checked the critical ones (libibverbs, librdmacm, libmlx4-rdmav2, etc.), and they appear to be the same.

Can anyone suggest how I might start tracking this problem down?


[binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c
[binf102:31268] *** Process received signal ***
[binf102:31268] Signal: Segmentation fault (11)
[binf102:31268] Signal code: Address not mapped (1)
[binf102:31268] Failing at address: 0x10
[binf102:31268] [ 0] /lib64/ [0x2b42213f57c0]
[binf102:31268] [ 1] /xxxx/yyyy_ib/intel- [0x2b42203fd7e3]
[binf102:31268] [ 2] /xxxx/yyyy_ib/intel- [0x2b4220400d3b]
[binf102:31268] [ 3] /xxxx/yyyy_ib/intel- [0x2b42204008ef]
[binf102:31268] [ 4] /xxxx/yyyy_ib/intel- [0x2b4220400876]
[binf102:31268] [ 5] /xxxx/yyyy_ib/intel- [0x2b422572334c]
[binf102:31268] [ 6] /xxxx/yyyy_ib/intel- [0x2b422041d64a]
[binf102:31268] [ 7] /xxxx/yyyy_ib/intel- [0x2b422573612f]
[binf102:31268] [ 8] /lib64/ [0x2b42213ed7b6]
[binf102:31268] [ 9] /lib64/ [0x2b42216dcd6d]
[binf102:31268] *** End of error message ***
mpirun noticed that process rank 0 with PID 31268 on node xxxx102 exited on signal 11 (Segmentation fault).