Hello,
Just to give an update on the list:
Today, I implemented message data reliability verification in my code using
the CRC32 algorithm.
Without PSM, everything runs fine.
With PSM, I get these errors:
Error: RayPlatform detected a message corruption !
Tag: RAY_MPI_TAG_REQUEST_VERTEX_COVERAGE_REPLY
Source: 3
Destination: 3
sizeof(MessageUnit): 8
Count (excluding checksum): 1
Expected checksum (CRC32): ea
Actual checksum (CRC32): 4f3b6143
Error: RayPlatform detected a message corruption !
Tag: RAY_MPI_TAG_GET_READ_MATE
Source: 4
Destination: 4
sizeof(MessageUnit): 8
Count (excluding checksum): 1
Expected checksum (CRC32): f4240
Actual checksum (CRC32): 0
Error: RayPlatform detected a message corruption !
Tag: RAY_MPI_TAG_REQUEST_VERTEX_COVERAGE_REPLY
Source: 5
Destination: 5
sizeof(MessageUnit): 8
Count (excluding checksum): 7
Expected checksum (CRC32): dd94edd5
Error: RayPlatform detected a message corruption !
Tag: RAY_MPI_TAG_GET_VERTEX_EDGES_COMPACT
Source: 5
Destination: 5
sizeof(MessageUnit): 8
Count (excluding checksum): 2
Expected checksum (CRC32): e80f2c45
Actual checksum (CRC32): 0
Error: RayPlatform detected a message corruption !
Tag: RAY_MPI_TAG_GET_VERTEX_EDGES_COMPACT_REPLY
Source: 5
Destination: 5
sizeof(MessageUnit): 8
Count (excluding checksum): 2
Expected checksum (CRC32): 42
Actual checksum (CRC32): a906f61
Error: RayPlatform detected a message corruption !
Tag: RAY_MPI_TAG_REQUEST_VERTEX_COVERAGE
Source: 12
Destination: 12
sizeof(MessageUnit): 8
Count (excluding checksum): 3
Expected checksum (CRC32): 5b6f1504
Actual checksum (CRC32): d5b3049a
Error: RayPlatform detected a message corruption !
Tag: RAY_MPI_TAG_REQUEST_VERTEX_READS
Source: 27
Destination: 27
sizeof(MessageUnit): 8
Count (excluding checksum): 5
Expected checksum (CRC32): fc01eda4
Actual checksum (CRC32): 0
I guess this is when the Open-MPI PML (point-to-point messaging layer)
dr (data reliability) would be helpful.
I now have a open case with the QLogic support.
Thank you for your help.
Jeff Squyres a écrit :
> Yes, PSM is the native transport for InfiniPath. It is faster than the InfiniBand verbs support on the same hardware.
>
> What version of Open MPI are you using?
>
>
> On Jun 28, 2012, at 10:03 PM, Sébastien Boisvert wrote:
>
>> Hello,
>>
>> I am getting random crashes (segmentation faults) on a super computer (guillimin)
>> using 3 nodes with 12 cores per node. The same program (Ray) runs without any
>> problem on the other super computers I use.
>>
>> The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342 QDR HCA" and
>> the messages transit using "performance scaled messaging" (PSM) which I think is some
>> sort of replacement to Infiniband verbs although I am not sure.
>>
>> Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options solves
>> the problem, but increases the latency from 20 microseconds to 55 microseconds.
>>
>> There seems to be some sort of message corruption during the transit, but I can not rule out
>> other explanations.
>>
>>
>> I have no idea what is going on and why disabling PSM solves the problem.
>>
>>
>> Versions
>>
>> module load gcc/4.5.3
>> module load openmpi/1.4.3-gcc
>>
>>
>> Command that randomly crashes
>>
>> mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \
>> Ray -k 31 \
>> -o MiSeq-bug-2012-06-28.1 \
>> -p \
>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
>>
>>
>> Command that completes successfully
>>
>> mpiexec -n 36 -output-filename psm-bug-2012-06-26-hotfix.1 \
>> --mca mtl ^psm \
>> Ray -k 31 \
>> -o psm-bug-2012-06-26-hotfix.1 \
>> -p \
>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
>>
>>
>>
>> Sébastien Boisvert
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
|