Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Performance scaled messaging and random crashes
From: Sébastien Boisvert (Sebastien.Boisvert.3_at_[hidden])
Date: 2012-06-29 14:21:45


Hello,

The latency of 20 microseconds is for 4000-byte messages
going from MPI rank A to MPI rank B and then back to MPI rank A.

For a one-way trip, it is 10 microseconds.

And the latency for 1-byte messages
from MPI rank A to MPI rank B is already below 3 microseconds.

I will contact you off-list.

Thank you.

Elken, Tom a écrit :
> Hi Sebastien,
>
> The Infinipath / PSM software that was developed by PathScale/QLogic is now part of Intel.
>
> I'll advise you off-list about how to contact our customer support so we can gather information about your software installation and work to resolve your issue.
>
> The 20 microseconds latency you are getting with Open MPI / PSM is still way too high, so there may be some network issue which needs to be solved first.
>
> -Tom
>
>> -----Original Message-----
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
>> Behalf Of Sébastien Boisvert
>> Sent: Friday, June 29, 2012 10:56 AM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Performance scaled messaging and random crashes
>>
>> Hi,
>>
>> Thank you for the direction.
>>
>> I installed Open-MPI 1.6 and the program is also crashing with 1.6.
>>
>> Could there be a bug in my code ?
>>
>> I don't see how disabling PSM would make the bug go away if the bug is in my
>> code.
>>
>>
>> Open-MPI configure command
>>
>> module load gcc/4.5.3
>>
>> ./configure \
>> --prefix=/sb/project/nne-790-ab/software/Open-MPI/1.6/Build \ --with-openib
>> \ --with-psm \ --with-tm=/software/tools/torque/ \
>> | tee configure.log
>>
>>
>> Versions
>>
>> module load gcc/4.5.3
>> module load /sb/project/nne-790-ab/software/modulefiles/mpi/Open-MPI/1.6
>> module load /sb/project/nne-790-ab/software/modulefiles/apps/ray/2.0.0
>>
>>
>>
>> PSM parameters
>>
>> guillimin> ompi_info -a|grep psm
>> MCA mtl: psm (MCA v2.0, API v2.0, Component v1.6)
>> MCA mtl: parameter "mtl_psm_connect_timeout" (current
>> value:<180>, data source: default value)
>> MCA mtl: parameter "mtl_psm_debug" (current value:
>> <1>, data source: default value)
>> MCA mtl: parameter "mtl_psm_ib_unit" (current value:
>> <-1>, data source: default value)
>> MCA mtl: parameter "mtl_psm_ib_port" (current value:
>> <0>, data source: default value)
>> MCA mtl: parameter "mtl_psm_ib_service_level" (current
>> value:<0>, data source: default value)
>> MCA mtl: parameter "mtl_psm_ib_pkey" (current value:
>> <32767>, data source: default value)
>> MCA mtl: parameter "mtl_psm_ib_service_id" (current
>> value:<0x1000117500000000>, data source: default value)
>> MCA mtl: parameter "mtl_psm_path_query" (current
>> value:<none>, data source: default value)
>> MCA mtl: parameter "mtl_psm_priority" (current value:
>> <0>, data source: default value)
>>
>>
>> Thank you.
>>
>>
>> Sébastien Boisvert
>>
>>
>> Jeff Squyres a écrit :
>>> The Open MPI 1.4 series is now deprecated. Can you upgrade to Open MPI
>> 1.6?
>>>
>>> On Jun 29, 2012, at 9:02 AM, Sébastien Boisvert wrote:
>>>
>>>> I am using Open-MPI 1.4.3 compiled with gcc 4.5.3.
>>>>
>>>> The library:
>>>>
>>>> /usr/lib64/libpsm_infinipath.so.1.14: ELF 64-bit LSB shared object,
>>>> AMD x86-64, version 1 (SYSV), not stripped
>>>>
>>>>
>>>>
>>>> Jeff Squyres a écrit :
>>>>> Yes, PSM is the native transport for InfiniPath. It is faster than the
>> InfiniBand verbs support on the same hardware.
>>>>> What version of Open MPI are you using?
>>>>>
>>>>>
>>>>> On Jun 28, 2012, at 10:03 PM, Sébastien Boisvert wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I am getting random crashes (segmentation faults) on a super
>>>>>> computer (guillimin) using 3 nodes with 12 cores per node. The same
>>>>>> program (Ray) runs without any problem on the other super computers I
>> use.
>>>>>> The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342
>>>>>> QDR HCA" and the messages transit using "performance scaled
>>>>>> messaging" (PSM) which I think is some sort of replacement to Infiniband
>> verbs although I am not sure.
>>>>>> Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options
>>>>>> solves the problem, but increases the latency from 20 microseconds to 55
>> microseconds.
>>>>>> There seems to be some sort of message corruption during the
>>>>>> transit, but I can not rule out other explanations.
>>>>>>
>>>>>>
>>>>>> I have no idea what is going on and why disabling PSM solves the problem.
>>>>>>
>>>>>>
>>>>>> Versions
>>>>>>
>>>>>> module load gcc/4.5.3
>>>>>> module load openmpi/1.4.3-gcc
>>>>>>
>>>>>>
>>>>>> Command that randomly crashes
>>>>>>
>>>>>> mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \ Ray -k 31 \
>>>>>> -o MiSeq-bug-2012-06-28.1 \ -p \
>>>>>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fast
>>>>>> q \
>>>>>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fast
>>>>>> q
>>>>>>
>>>>>>
>>>>>> Command that completes successfully
>>>>>>
>>>>>> mpiexec -n 36 -output-filename psm-bug-2012-06-26-hotfix.1 \ --mca
>>>>>> mtl ^psm \ Ray -k 31 \ -o psm-bug-2012-06-26-hotfix.1 \ -p \
>>>>>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fast
>>>>>> q \
>>>>>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fast
>>>>>> q
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sébastien Boisvert
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users