Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] r27078 and OMPI build
From: Eugene Loh (eugene.loh_at_[hidden])
Date: 2012-08-24 21:47:47


Indeed. Sorry to jump late back into the melee. I did reproduce the
problem on a second SPARC system, to answer Ralph's earlier question; I
don't know how interesting that is given that it's very similar to the
original system. And, to corroborate Paul's AMD observation, we have an
x86/Solaris/Studio system that is *not* seeing the problem. Thanks to
Paul for identifying the likely cause of the problem.

On 8/24/2012 6:32 PM, Ralph Castain wrote:
> Thanks Paul!! That is very helpful - hopefully the ORNL folks can now
> fix the problem.
>
> On Aug 24, 2012, at 6:29 PM, Paul Hargrove <phhargrove_at_[hidden]
> <mailto:phhargrove_at_[hidden]>> wrote:
>
>> I *can* reproduce the problem on SPARC/Solaris-10 with the SS12.3
>> compiler and an ALMOST vanilla configure:
>> $ [path_to]configure \
>> --prefix=[blah] CC=cc CXX=CC F77=f77 FC=f90 \
>> CFLAGS="-m64" --with-wrapper-cflags="-m64" CXXFLAGS="-m64"
>> --with-wrapper-cxxflags="-m64" \
>> FFLAGS="-m64" --with-wrapper-fflags="-m64" FCFLAGS="-m64"
>> --with-wrapper-fcflags="-m64" \
>> CXXFLAGS="-m64 -library=stlport4"
>>
>> I did NOT manage to reproduce on AMD64/Solaris-11, which completed a
>> build w/ VT disabled.
>> Unfortunately I have neither SPARC/Solaris-11 nor
>> AMD64/Solaris-10 readily available to disambiguate the key factor.
>> Hopefully it is enough to know that the problem is reproducible w/o
>> Oracle's massive configure commandline.
>>
>>
>> The build isn't complete, but I can already see that the symbol has
>> "leaked" into libmpi:
>>
>> $ grep -arl mca_coll_ml_memsync_intra BLD/
>> BLD/ompi/mca/bcol/.libs/libmca_bcol.a
>> BLD/ompi/mca/bcol/base/.libs/bcol_base_open.o
>> BLD/ompi/.libs/libmpi.so.0.0.0
>> BLD/ompi/.libs/libmpi.so
>> BLD/ompi/.libs/libmpi.so.0
>>
>> It is referenced by mca_coll_ml_generic_collectives_launcher:
>>
>> $ nm BLD/ompi/.libs/libmpi.so.0.0.0 | grep -B1 mca_coll_ml_memsync_intra
>> 00000000006a6088 t mca_coll_ml_generic_collectives_launcher
>> U mca_coll_ml_memsync_intra
>>
>> This is coming from libmca_bcol.a:
>> $ nm BLD/ompi/mca/bcol/.libs/libmca_bcol.a | grep -B1
>> mca_coll_ml_memsync_intra
>> 0000000000005248 t mca_coll_ml_generic_collectives_launcher
>> U mca_coll_ml_memsync_intra
>>
>>
>> This appears to be via the following chain of calls within coll_ml.h:
>>
>> mca_coll_ml_generic_collectives_launcher
>> mca_coll_ml_task_completion_processing
>> coll_ml_fragment_completion_processing
>> mca_coll_ml_buffer_recycling
>> mca_coll_ml_memsync_intra
>>
>> All of which are marked as "static
>> inline __opal_attribute_always_inline__".
>>
>> -Paul
>>
>>
>> On Fri, Aug 24, 2012 at 4:55 PM, Paul Hargrove <phhargrove_at_[hidden]
>> <mailto:phhargrove_at_[hidden]>> wrote:
>>
>> OK, I have a vanilla configure+make running on both
>> SPARC/Solaris-10 and AMD64/Solaris-11.
>> I am using the 12.3 Oracle compilers in both cases to match the
>> original report.
>> I'll post the results when they complete.
>>
>> In the meantime, I took a quick look at the code and have a
>> pretty reasonable guess as to the cause.
>> Looking at ompi/mca/coll/ml/coll_ml.h I see:
>>
>> 827 int mca_coll_ml_memsync_intra(mca_coll_ml_module_t
>> *module, int bank_index);
>> [...]
>> 996 static inline __opal_attribute_always_inline__
>> 997 int
>> mca_coll_ml_buffer_recycling(mca_coll_ml_collective_operation_progress_t
>> *ml_request)
>> 998 {
>> [...]
>> 1023 rc = mca_coll_ml_memsync_intra(ml_module,
>> ml_memblock->memsync_counter);
>> [...]
>> 1041 }
>>
>> Based on past experience w/ the Sun/Oracle compilers on another
>> project (See
>> http://bugzilla.hcs.ufl.edu/cgi-bin/bugzilla3/show_bug.cgi?id=193 ),
>> I suspect that this static-inline-always function is
>> being emitted by the compiler in every object which includes this
>> header even if they don't call it.. The call on line 1023 then
>> results in the undefined reference to mca_coll_ml_memsync_intra.
>> Basically it is not safe for an inline function in a header to
>> call an extern function that isn't available to every object that
>> includes the header REGARDLESS of whether the object invokes the
>> inline function or not.
>>
>> -Paul
>>
>>
>>
>> On Fri, Aug 24, 2012 at 4:40 PM, Ralph Castain <rhc_at_[hidden]
>> <mailto:rhc_at_[hidden]>> wrote:
>>
>> Oracle uses an abysmally complicated configure line, but
>> nearly all of it is irrelevant to the problem here. For this,
>> I would suggest just doing a vanilla ./configure - if the
>> component gets pulled into libmpi, then we know there is a
>> problem.
>>
>> Thanks!
>>
>> Just FYI: here is there actual configure line, just in case
>> you spot something problematic:
>>
>> CC=cc CXX=CC F77=f77 FC=f90 --with-openib --enable-openib-connectx-xrc --without-udapl
>> --disable-openib-ibcm --enable-btl-openib-failover --without-dtrace --enable-heterogeneous
>> --enable-cxx-exceptions --enable-shared --enable-orterun-prefix-by-default --with-sge
>> --enable-mpi-f90 --with-mpi-f90-size=small --disable-peruse --disable-state
>> --disable-mpi-thread-multiple --disable-debug --disable-mem-debug --disable-mem-profile
>> CFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch -xprefetch_level=2 -xvector=lib -Qoption
>> cg -xregs=no%appl -xdepend=yes -xbuiltin=%all -xO5" CXXFLAGS="-xtarget=ultra3 -m32
>> -xarch=sparcvis2 -xprefetch -xprefetch_level=2 -xvector=lib -Qoption cg -xregs=no%appl -xdepend=yes
>> -xbuiltin=%all -xO5 -Bstatic -lCrun -lCstd -Bdynamic" FFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2
>> -xprefetch -xprefetch_level=2 -xvector=lib -Qoption cg -xregs=no%appl -stackvar -xO5"
>> FCFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch -xprefetch_level=2 -xvector=lib -Qoption
>> cg -xregs=no%appl -stackvar -xO5"
>> --prefix=/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/installs/JA08/install
>> --mandir=${prefix}/man --bindir=${prefix}/bin --libdir=${prefix}/lib
>> --includedir=${prefix}/include --with-tm=/ws/ompi-tools/orte/torque/current/shared-install32
>> --enable-contrib-no-build=vt --with-package-string="Oracle Message Passing Toolkit "
>> --with-ident-string="@(#)RELEASE VERSION 1.9openmpi-1.5.4-r1.9a1r27092"
>>
>>
>> and the error he gets is:
>>
>> make[2]: Entering directory
>> `/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi/tools/ompi_info'
>> CCLD ompi_info
>> Undefined first referenced
>> symbol in file
>> mca_coll_ml_memsync_intra ../../../ompi/.libs/libmpi.so
>> ld: fatal: symbol referencing errors. No output written to .libs/ompi_info
>> make[2]: *** [ompi_info] Error 2
>> make[2]: Leaving directory
>> `/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi/tools/ompi_info'
>> make[1]: *** [install-recursive] Error 1
>> make[1]: Leaving directory
>> `/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi'
>> make: *** [install-recursive] Error 1
>>
>>
>> On Aug 24, 2012, at 4:30 PM, Paul Hargrove
>> <phhargrove_at_[hidden] <mailto:phhargrove_at_[hidden]>> wrote:
>>
>>> I have access to a few different Solaris machines and can
>>> offer to build the trunk if somebody tells me what configure
>>> flags are desired.
>>>
>>> -Paul
>>>
>>> On Fri, Aug 24, 2012 at 8:54 AM, Ralph Castain
>>> <rhc_at_[hidden] <mailto:rhc_at_[hidden]>> wrote:
>>>
>>> Eugene - can you confirm that this is only happening on
>>> the one Solaris system? In other words, is this a
>>> general issue or something specific to that one machine?
>>>
>>> I'm wondering because if it is just the one machine,
>>> then it might be something strange about how it is setup
>>> - perhaps the version of Solaris, or it is configuring
>>> --enable-static, or...
>>>
>>> Just trying to assess how general a problem this might
>>> be, and thus if this should be a blocker or not.
>>>
>>> On Aug 24, 2012, at 8:00 AM, Eugene Loh
>>> <eugene.loh_at_[hidden] <mailto:eugene.loh_at_[hidden]>>
>>> wrote:
>>>
>>> > On 08/24/12 09:54, Shamis, Pavel wrote:
>>> >> Maybe there is a chance to get direct access to this
>>> system ?
>>> > No.
>>> >
>>> > But I'm attaching compressed log files from
>>> configure/make.
>>> >
>>> >
>>> <tarball-of-log-files.tar.bz2>_______________________________________________
>>> > devel mailing list
>>> > devel_at_[hidden] <mailto:devel_at_[hidden]>
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>>
>>> --
>>> Paul H. Hargrove PHHargrove_at_[hidden] <mailto:PHHargrove_at_[hidden]>
>>> Future Technologies Group
>>> Computer and Data Sciences Department Tel:
>>> +1-510-495-2352 <tel:%2B1-510-495-2352>
>>> Lawrence Berkeley National Laboratory Fax:
>>> +1-510-486-6900 <tel:%2B1-510-486-6900>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>>
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden] <mailto:PHHargrove_at_[hidden]>
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> <tel:%2B1-510-495-2352>
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> <tel:%2B1-510-486-6900>
>>
>>
>>
>>
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden] <mailto:PHHargrove_at_[hidden]>
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel