Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Performance scaled messaging and random crashes
From: Sébastien Boisvert (Sebastien.Boisvert.3_at_[hidden])
Date: 2012-06-29 09:02:51


I am using Open-MPI 1.4.3 compiled with gcc 4.5.3.

The library:

/usr/lib64/libpsm_infinipath.so.1.14: ELF 64-bit LSB shared object, AMD
x86-64, version 1 (SYSV), not stripped

Jeff Squyres a écrit :
> Yes, PSM is the native transport for InfiniPath. It is faster than the InfiniBand verbs support on the same hardware.
>
> What version of Open MPI are you using?
>
>
> On Jun 28, 2012, at 10:03 PM, Sébastien Boisvert wrote:
>
>> Hello,
>>
>> I am getting random crashes (segmentation faults) on a super computer (guillimin)
>> using 3 nodes with 12 cores per node. The same program (Ray) runs without any
>> problem on the other super computers I use.
>>
>> The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342 QDR HCA" and
>> the messages transit using "performance scaled messaging" (PSM) which I think is some
>> sort of replacement to Infiniband verbs although I am not sure.
>>
>> Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options solves
>> the problem, but increases the latency from 20 microseconds to 55 microseconds.
>>
>> There seems to be some sort of message corruption during the transit, but I can not rule out
>> other explanations.
>>
>>
>> I have no idea what is going on and why disabling PSM solves the problem.
>>
>>
>> Versions
>>
>> module load gcc/4.5.3
>> module load openmpi/1.4.3-gcc
>>
>>
>> Command that randomly crashes
>>
>> mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \
>> Ray -k 31 \
>> -o MiSeq-bug-2012-06-28.1 \
>> -p \
>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
>>
>>
>> Command that completes successfully
>>
>> mpiexec -n 36 -output-filename psm-bug-2012-06-26-hotfix.1 \
>> --mca mtl ^psm \
>> Ray -k 31 \
>> -o psm-bug-2012-06-26-hotfix.1 \
>> -p \
>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
>>
>>
>>
>> Sébastien Boisvert
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>