Hi,
I'm observing a random segmentation fault during an internode parallel
computation involving the openib btl and OpenMPI-1.4.2 (the same issue
can be observed with OpenMPI-1.3.3).
mpirun (Open MPI) 1.4.2
Report bugs to http://www.open-mpi.org/community/help/
[pbn08:02624] *** Process received signal ***
[pbn08:02624] Signal: Segmentation fault (11)
[pbn08:02624] Signal code: Address not mapped (1)
[pbn08:02624] Failing at address: (nil)
[pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
[pbn08:02624] *** End of error message ***
sh: line 1: 2624 Segmentation fault
\/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/x86_6
4\ /bin\/actranpy_mp
'--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86_64/A
c tran_11.0.rc2.41872'
'--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.dat'
'--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch' '--mem=3200'
'--threads=1' '--errorlevel=FATAL' '--t_max=0.1' '--parallel=domain'
If I choose not to use the openib btl (by using --mca btl self,sm,tcp on
the command line, for instance), I don't encounter any problem and the
parallel computation runs flawlessly.
I would like to get some help to be able:
- to diagnose the issue I'm facing with the openib btl
- understand why this issue is observed only when using the openib btl
and not when using self,sm,tcp
Any help would be very much appreciated.
The outputs of ompi_info and the configure scripts of OpenMPI are
enclosed to this email, and some information on the infiniband drivers
as well.
Here is the command line used when launching a parallel computation
using infiniband:
path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list --mca
btl openib,sm,self,tcp --display-map --verbose --version --mca
mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
and the command line used if not using infiniband:
path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list --mca
btl self,sm,tcp --display-map --verbose --version --mca
mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]
Thanks,
Eloi