Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Performance scaled messaging and random crashes
From: Sébastien Boisvert (Sebastien.Boisvert.3_at_[hidden])
Date: 2012-06-28 22:03:03


Hello,

I am getting random crashes (segmentation faults) on a super computer
(guillimin)
using 3 nodes with 12 cores per node. The same program (Ray) runs
without any
problem on the other super computers I use.

The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342 QDR
HCA" and
the messages transit using "performance scaled messaging" (PSM) which I
think is some
sort of replacement to Infiniband verbs although I am not sure.

Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options solves
the problem, but increases the latency from 20 microseconds to 55
microseconds.

There seems to be some sort of message corruption during the transit,
but I can not rule out
other explanations.

I have no idea what is going on and why disabling PSM solves the problem.

Versions

module load gcc/4.5.3
module load openmpi/1.4.3-gcc

Command that randomly crashes

mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \
Ray -k 31 \
-o MiSeq-bug-2012-06-28.1 \
-p \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq

Command that completes successfully

mpiexec -n 36 -output-filename psm-bug-2012-06-26-hotfix.1 \
--mca mtl ^psm \
Ray -k 31 \
-o psm-bug-2012-06-26-hotfix.1 \
-p \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq

Sébastien Boisvert