Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Performance scaled messaging and random crashes
From: Elken, Tom (tom.elken_at_[hidden])
Date: 2012-06-29 14:14:01


Hi Sebastien,

The Infinipath / PSM software that was developed by PathScale/QLogic is now part of Intel.

I'll advise you off-list about how to contact our customer support so we can gather information about your software installation and work to resolve your issue.

The 20 microseconds latency you are getting with Open MPI / PSM is still way too high, so there may be some network issue which needs to be solved first.

-Tom

> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
> Behalf Of Sébastien Boisvert
> Sent: Friday, June 29, 2012 10:56 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] Performance scaled messaging and random crashes
>
> Hi,
>
> Thank you for the direction.
>
> I installed Open-MPI 1.6 and the program is also crashing with 1.6.
>
> Could there be a bug in my code ?
>
> I don't see how disabling PSM would make the bug go away if the bug is in my
> code.
>
>
> Open-MPI configure command
>
> module load gcc/4.5.3
>
> ./configure \
> --prefix=/sb/project/nne-790-ab/software/Open-MPI/1.6/Build \ --with-openib
> \ --with-psm \ --with-tm=/software/tools/torque/ \
> | tee configure.log
>
>
> Versions
>
> module load gcc/4.5.3
> module load /sb/project/nne-790-ab/software/modulefiles/mpi/Open-MPI/1.6
> module load /sb/project/nne-790-ab/software/modulefiles/apps/ray/2.0.0
>
>
>
> PSM parameters
>
> guillimin> ompi_info -a|grep psm
> MCA mtl: psm (MCA v2.0, API v2.0, Component v1.6)
> MCA mtl: parameter "mtl_psm_connect_timeout" (current
> value: <180>, data source: default value)
> MCA mtl: parameter "mtl_psm_debug" (current value:
> <1>, data source: default value)
> MCA mtl: parameter "mtl_psm_ib_unit" (current value:
> <-1>, data source: default value)
> MCA mtl: parameter "mtl_psm_ib_port" (current value:
> <0>, data source: default value)
> MCA mtl: parameter "mtl_psm_ib_service_level" (current
> value: <0>, data source: default value)
> MCA mtl: parameter "mtl_psm_ib_pkey" (current value:
> <32767>, data source: default value)
> MCA mtl: parameter "mtl_psm_ib_service_id" (current
> value: <0x1000117500000000>, data source: default value)
> MCA mtl: parameter "mtl_psm_path_query" (current
> value: <none>, data source: default value)
> MCA mtl: parameter "mtl_psm_priority" (current value:
> <0>, data source: default value)
>
>
> Thank you.
>
>
> Sébastien Boisvert
>
>
> Jeff Squyres a écrit :
> > The Open MPI 1.4 series is now deprecated. Can you upgrade to Open MPI
> 1.6?
> >
> >
> > On Jun 29, 2012, at 9:02 AM, Sébastien Boisvert wrote:
> >
> >> I am using Open-MPI 1.4.3 compiled with gcc 4.5.3.
> >>
> >> The library:
> >>
> >> /usr/lib64/libpsm_infinipath.so.1.14: ELF 64-bit LSB shared object,
> >> AMD x86-64, version 1 (SYSV), not stripped
> >>
> >>
> >>
> >> Jeff Squyres a écrit :
> >>> Yes, PSM is the native transport for InfiniPath. It is faster than the
> InfiniBand verbs support on the same hardware.
> >>>
> >>> What version of Open MPI are you using?
> >>>
> >>>
> >>> On Jun 28, 2012, at 10:03 PM, Sébastien Boisvert wrote:
> >>>
> >>>> Hello,
> >>>>
> >>>> I am getting random crashes (segmentation faults) on a super
> >>>> computer (guillimin) using 3 nodes with 12 cores per node. The same
> >>>> program (Ray) runs without any problem on the other super computers I
> use.
> >>>>
> >>>> The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342
> >>>> QDR HCA" and the messages transit using "performance scaled
> >>>> messaging" (PSM) which I think is some sort of replacement to Infiniband
> verbs although I am not sure.
> >>>>
> >>>> Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options
> >>>> solves the problem, but increases the latency from 20 microseconds to 55
> microseconds.
> >>>>
> >>>> There seems to be some sort of message corruption during the
> >>>> transit, but I can not rule out other explanations.
> >>>>
> >>>>
> >>>> I have no idea what is going on and why disabling PSM solves the problem.
> >>>>
> >>>>
> >>>> Versions
> >>>>
> >>>> module load gcc/4.5.3
> >>>> module load openmpi/1.4.3-gcc
> >>>>
> >>>>
> >>>> Command that randomly crashes
> >>>>
> >>>> mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \ Ray -k 31 \
> >>>> -o MiSeq-bug-2012-06-28.1 \ -p \
> >>>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fast
> >>>> q \
> >>>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fast
> >>>> q
> >>>>
> >>>>
> >>>> Command that completes successfully
> >>>>
> >>>> mpiexec -n 36 -output-filename psm-bug-2012-06-26-hotfix.1 \ --mca
> >>>> mtl ^psm \ Ray -k 31 \ -o psm-bug-2012-06-26-hotfix.1 \ -p \
> >>>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fast
> >>>> q \
> >>>> data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fast
> >>>> q
> >>>>
> >>>>
> >>>>
> >>>> Sébastien Boisvert
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users