Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Performance scaled messaging and random crashes
From: Sébastien Boisvert (Sebastien.Boisvert.3_at_[hidden])
Date: 2012-06-30 21:25:26


Hello,

Just to give an update on the list:

Today, I implemented message data reliability verification in my code using
the CRC32 algorithm.

Without PSM, everything runs fine.

With PSM, I get these errors:

Error: RayPlatform detected a message corruption !
  Tag: RAY_MPI_TAG_REQUEST_VERTEX_COVERAGE_REPLY
  Source: 3
  Destination: 3
  sizeof(MessageUnit): 8
  Count (excluding checksum): 1
  Expected checksum (CRC32): ea
  Actual checksum (CRC32): 4f3b6143

Error: RayPlatform detected a message corruption !
  Tag: RAY_MPI_TAG_GET_READ_MATE
  Source: 4
  Destination: 4
  sizeof(MessageUnit): 8
  Count (excluding checksum): 1
  Expected checksum (CRC32): f4240
  Actual checksum (CRC32): 0

Error: RayPlatform detected a message corruption !
  Tag: RAY_MPI_TAG_REQUEST_VERTEX_COVERAGE_REPLY
  Source: 5
  Destination: 5
  sizeof(MessageUnit): 8
  Count (excluding checksum): 7
  Expected checksum (CRC32): dd94edd5

Error: RayPlatform detected a message corruption !
  Tag: RAY_MPI_TAG_GET_VERTEX_EDGES_COMPACT
  Source: 5
  Destination: 5
  sizeof(MessageUnit): 8
  Count (excluding checksum): 2
  Expected checksum (CRC32): e80f2c45
  Actual checksum (CRC32): 0

Error: RayPlatform detected a message corruption !
  Tag: RAY_MPI_TAG_GET_VERTEX_EDGES_COMPACT_REPLY
  Source: 5
  Destination: 5
  sizeof(MessageUnit): 8
  Count (excluding checksum): 2
  Expected checksum (CRC32): 42
  Actual checksum (CRC32): a906f61

Error: RayPlatform detected a message corruption !
  Tag: RAY_MPI_TAG_REQUEST_VERTEX_COVERAGE
  Source: 12
  Destination: 12
  sizeof(MessageUnit): 8
  Count (excluding checksum): 3
  Expected checksum (CRC32): 5b6f1504
  Actual checksum (CRC32): d5b3049a

Error: RayPlatform detected a message corruption !
  Tag: RAY_MPI_TAG_REQUEST_VERTEX_READS
  Source: 27
  Destination: 27
  sizeof(MessageUnit): 8
  Count (excluding checksum): 5
  Expected checksum (CRC32): fc01eda4
  Actual checksum (CRC32): 0

I guess this is when the Open-MPI PML (point-to-point messaging layer)
dr (data reliability) would be helpful.

I now have a open case with the QLogic support.

Thank you for your help.

Jeff Squyres a écrit :
> Yes, PSM is the native transport for InfiniPath. It is faster than the InfiniBand verbs support on the same hardware.
>
> What version of Open MPI are you using?
>
>
> On Jun 28, 2012, at 10:03 PM, Sébastien Boisvert wrote:
>
>> Hello,
>>
>> I am getting random crashes (segmentation faults) on a super computer (guillimin)
>> using 3 nodes with 12 cores per node. The same program (Ray) runs without any
>> problem on the other super computers I use.
>>
>> The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342 QDR HCA" and
>> the messages transit using "performance scaled messaging" (PSM) which I think is some
>> sort of replacement to Infiniband verbs although I am not sure.
>>
>> Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options solves
>> the problem, but increases the latency from 20 microseconds to 55 microseconds.
>>
>> There seems to be some sort of message corruption during the transit, but I can not rule out
>> other explanations.
>>
>>
>> I have no idea what is going on and why disabling PSM solves the problem.
>>
>>
>> Versions
>>
>> module load gcc/4.5.3
>> module load openmpi/1.4.3-gcc
>>
>>
>> Command that randomly crashes
>>
>> mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \
>> Ray -k 31 \
>> -o MiSeq-bug-2012-06-28.1 \
>> -p \
>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
>>
>>
>> Command that completes successfully
>>
>> mpiexec -n 36 -output-filename psm-bug-2012-06-26-hotfix.1 \
>> --mca mtl ^psm \
>> Ray -k 31 \
>> -o psm-bug-2012-06-26-hotfix.1 \
>> -p \
>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
>>
>>
>>
>> Sébastien Boisvert
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>