Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Performance scaled messaging and random crashes
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-06-29 09:55:31


The Open MPI 1.4 series is now deprecated. Can you upgrade to Open MPI 1.6?

On Jun 29, 2012, at 9:02 AM, Sébastien Boisvert wrote:

> I am using Open-MPI 1.4.3 compiled with gcc 4.5.3.
>
> The library:
>
> /usr/lib64/libpsm_infinipath.so.1.14: ELF 64-bit LSB shared object, AMD x86-64, version 1 (SYSV), not stripped
>
>
>
> Jeff Squyres a écrit :
>> Yes, PSM is the native transport for InfiniPath. It is faster than the InfiniBand verbs support on the same hardware.
>>
>> What version of Open MPI are you using?
>>
>>
>> On Jun 28, 2012, at 10:03 PM, Sébastien Boisvert wrote:
>>
>>> Hello,
>>>
>>> I am getting random crashes (segmentation faults) on a super computer (guillimin)
>>> using 3 nodes with 12 cores per node. The same program (Ray) runs without any
>>> problem on the other super computers I use.
>>>
>>> The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342 QDR HCA" and
>>> the messages transit using "performance scaled messaging" (PSM) which I think is some
>>> sort of replacement to Infiniband verbs although I am not sure.
>>>
>>> Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options solves
>>> the problem, but increases the latency from 20 microseconds to 55 microseconds.
>>>
>>> There seems to be some sort of message corruption during the transit, but I can not rule out
>>> other explanations.
>>>
>>>
>>> I have no idea what is going on and why disabling PSM solves the problem.
>>>
>>>
>>> Versions
>>>
>>> module load gcc/4.5.3
>>> module load openmpi/1.4.3-gcc
>>>
>>>
>>> Command that randomly crashes
>>>
>>> mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \
>>> Ray -k 31 \
>>> -o MiSeq-bug-2012-06-28.1 \
>>> -p \
>>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
>>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
>>>
>>>
>>> Command that completes successfully
>>>
>>> mpiexec -n 36 -output-filename psm-bug-2012-06-26-hotfix.1 \
>>> --mca mtl ^psm \
>>> Ray -k 31 \
>>> -o psm-bug-2012-06-26-hotfix.1 \
>>> -p \
>>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
>>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
>>>
>>>
>>>
>>> Sébastien Boisvert
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/