Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Performance scaled messaging and random crashes
From: Sébastien Boisvert (Sebastien.Boisvert.3_at_[hidden])
Date: 2012-06-29 13:55:34


Hi,

Thank you for the direction.

I installed Open-MPI 1.6 and the program is also crashing with 1.6.

Could there be a bug in my code ?

I don't see how disabling PSM would make the bug go away if the bug
is in my code.

Open-MPI configure command

module load gcc/4.5.3

./configure \
--prefix=/sb/project/nne-790-ab/software/Open-MPI/1.6/Build \
--with-openib \
--with-psm \
--with-tm=/software/tools/torque/ \
| tee configure.log

Versions

module load gcc/4.5.3
module load /sb/project/nne-790-ab/software/modulefiles/mpi/Open-MPI/1.6
module load /sb/project/nne-790-ab/software/modulefiles/apps/ray/2.0.0

PSM parameters

guillimin> ompi_info -a|grep psm
                  MCA mtl: psm (MCA v2.0, API v2.0, Component v1.6)
                  MCA mtl: parameter "mtl_psm_connect_timeout" (current
value: <180>, data source: default value)
                  MCA mtl: parameter "mtl_psm_debug" (current value:
<1>, data source: default value)
                  MCA mtl: parameter "mtl_psm_ib_unit" (current value:
<-1>, data source: default value)
                  MCA mtl: parameter "mtl_psm_ib_port" (current value:
<0>, data source: default value)
                  MCA mtl: parameter "mtl_psm_ib_service_level" (current
value: <0>, data source: default value)
                  MCA mtl: parameter "mtl_psm_ib_pkey" (current value:
<32767>, data source: default value)
                  MCA mtl: parameter "mtl_psm_ib_service_id" (current
value: <0x1000117500000000>, data source: default value)
                  MCA mtl: parameter "mtl_psm_path_query" (current
value: <none>, data source: default value)
                  MCA mtl: parameter "mtl_psm_priority" (current value:
<0>, data source: default value)

Thank you.

Sébastien Boisvert

Jeff Squyres a écrit :
> The Open MPI 1.4 series is now deprecated. Can you upgrade to Open MPI 1.6?
>
>
> On Jun 29, 2012, at 9:02 AM, Sébastien Boisvert wrote:
>
>> I am using Open-MPI 1.4.3 compiled with gcc 4.5.3.
>>
>> The library:
>>
>> /usr/lib64/libpsm_infinipath.so.1.14: ELF 64-bit LSB shared object, AMD x86-64, version 1 (SYSV), not stripped
>>
>>
>>
>> Jeff Squyres a écrit :
>>> Yes, PSM is the native transport for InfiniPath. It is faster than the InfiniBand verbs support on the same hardware.
>>>
>>> What version of Open MPI are you using?
>>>
>>>
>>> On Jun 28, 2012, at 10:03 PM, Sébastien Boisvert wrote:
>>>
>>>> Hello,
>>>>
>>>> I am getting random crashes (segmentation faults) on a super computer (guillimin)
>>>> using 3 nodes with 12 cores per node. The same program (Ray) runs without any
>>>> problem on the other super computers I use.
>>>>
>>>> The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342 QDR HCA" and
>>>> the messages transit using "performance scaled messaging" (PSM) which I think is some
>>>> sort of replacement to Infiniband verbs although I am not sure.
>>>>
>>>> Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options solves
>>>> the problem, but increases the latency from 20 microseconds to 55 microseconds.
>>>>
>>>> There seems to be some sort of message corruption during the transit, but I can not rule out
>>>> other explanations.
>>>>
>>>>
>>>> I have no idea what is going on and why disabling PSM solves the problem.
>>>>
>>>>
>>>> Versions
>>>>
>>>> module load gcc/4.5.3
>>>> module load openmpi/1.4.3-gcc
>>>>
>>>>
>>>> Command that randomly crashes
>>>>
>>>> mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \
>>>> Ray -k 31 \
>>>> -o MiSeq-bug-2012-06-28.1 \
>>>> -p \
>>>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
>>>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
>>>>
>>>>
>>>> Command that completes successfully
>>>>
>>>> mpiexec -n 36 -output-filename psm-bug-2012-06-26-hotfix.1 \
>>>> --mca mtl ^psm \
>>>> Ray -k 31 \
>>>> -o psm-bug-2012-06-26-hotfix.1 \
>>>> -p \
>>>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
>>>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
>>>>
>>>>
>>>>
>>>> Sébastien Boisvert
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>