Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Performance scaled messaging and random crashes
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-06-29 07:30:14


Yes, PSM is the native transport for InfiniPath. It is faster than the InfiniBand verbs support on the same hardware.

What version of Open MPI are you using?

On Jun 28, 2012, at 10:03 PM, Sébastien Boisvert wrote:

> Hello,
>
> I am getting random crashes (segmentation faults) on a super computer (guillimin)
> using 3 nodes with 12 cores per node. The same program (Ray) runs without any
> problem on the other super computers I use.
>
> The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342 QDR HCA" and
> the messages transit using "performance scaled messaging" (PSM) which I think is some
> sort of replacement to Infiniband verbs although I am not sure.
>
> Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options solves
> the problem, but increases the latency from 20 microseconds to 55 microseconds.
>
> There seems to be some sort of message corruption during the transit, but I can not rule out
> other explanations.
>
>
> I have no idea what is going on and why disabling PSM solves the problem.
>
>
> Versions
>
> module load gcc/4.5.3
> module load openmpi/1.4.3-gcc
>
>
> Command that randomly crashes
>
> mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \
> Ray -k 31 \
> -o MiSeq-bug-2012-06-28.1 \
> -p \
> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
>
>
> Command that completes successfully
>
> mpiexec -n 36 -output-filename psm-bug-2012-06-26-hotfix.1 \
> --mca mtl ^psm \
> Ray -k 31 \
> -o psm-bug-2012-06-26-hotfix.1 \
> -p \
> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
>
>
>
> Sébastien Boisvert
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/