This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
I am getting random crashes (segmentation faults) on a super computer
using 3 nodes with 12 cores per node. The same program (Ray) runs
problem on the other super computers I use.
The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342 QDR
the messages transit using "performance scaled messaging" (PSM) which I
think is some
sort of replacement to Infiniband verbs although I am not sure.
Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options solves
the problem, but increases the latency from 20 microseconds to 55
There seems to be some sort of message corruption during the transit,
but I can not rule out
I have no idea what is going on and why disabling PSM solves the problem.
module load gcc/4.5.3
module load openmpi/1.4.3-gcc
Command that randomly crashes
mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \
Ray -k 31 \
-o MiSeq-bug-2012-06-28.1 \
Command that completes successfully
mpiexec -n 36 -output-filename psm-bug-2012-06-26-hotfix.1 \
--mca mtl ^psm \
Ray -k 31 \
-o psm-bug-2012-06-26-hotfix.1 \