Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI users] Performance scaled messaging and random crashes
From: Sébastien Boisvert (Sebastien.Boisvert.3_at_[hidden])
Date: 2012-06-28 22:03:03


I am getting random crashes (segmentation faults) on a super computer
using 3 nodes with 12 cores per node. The same program (Ray) runs
without any
problem on the other super computers I use.

The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342 QDR
HCA" and
the messages transit using "performance scaled messaging" (PSM) which I
think is some
sort of replacement to Infiniband verbs although I am not sure.

Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options solves
the problem, but increases the latency from 20 microseconds to 55

There seems to be some sort of message corruption during the transit,
but I can not rule out
other explanations.

I have no idea what is going on and why disabling PSM solves the problem.


module load gcc/4.5.3
module load openmpi/1.4.3-gcc

Command that randomly crashes

mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \
Ray -k 31 \
-o MiSeq-bug-2012-06-28.1 \
-p \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \

Command that completes successfully

mpiexec -n 36 -output-filename psm-bug-2012-06-26-hotfix.1 \
--mca mtl ^psm \
Ray -k 31 \
-o psm-bug-2012-06-26-hotfix.1 \
-p \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \

Sébastien Boisvert