I *can* reproduce the problem on SPARC/Solaris-10 with
the SS12.3 compiler and an ALMOST vanilla configure:
$ [path_to]configure \
--prefix=[blah] CC=cc CXX=CC F77=f77 FC=f90 \
CFLAGS="-m64" --with-wrapper-cflags="-m64"
CXXFLAGS="-m64" --with-wrapper-cxxflags="-m64" \
FFLAGS="-m64" --with-wrapper-fflags="-m64"
FCFLAGS="-m64" --with-wrapper-fcflags="-m64" \
CXXFLAGS="-m64 -library=stlport4"
I did NOT manage to reproduce on AMD64/Solaris-11,
which completed a build w/ VT disabled.
Unfortunately I have neither SPARC/Solaris-11 nor
AMD64/Solaris-10 readily available to disambiguate the key
factor.
Hopefully it is enough to know that the problem
is reproducible w/o Oracle's massive configure
commandline.
The build isn't complete, but I can already see that
the symbol has "leaked" into libmpi:
$ grep -arl mca_coll_ml_memsync_intra BLD/
BLD/ompi/mca/bcol/.libs/libmca_bcol.a
BLD/ompi/mca/bcol/base/.libs/bcol_base_open.o
BLD/ompi/.libs/libmpi.so.0.0.0
BLD/ompi/.libs/libmpi.so
BLD/ompi/.libs/libmpi.so.0
It is referenced
by mca_coll_ml_generic_collectives_launcher:
$ nm BLD/ompi/.libs/libmpi.so.0.0.0 | grep -B1
mca_coll_ml_memsync_intra
00000000006a6088 t
mca_coll_ml_generic_collectives_launcher
U mca_coll_ml_memsync_intra
This is coming from libmca_bcol.a:
$ nm BLD/ompi/mca/bcol/.libs/libmca_bcol.a | grep -B1
mca_coll_ml_memsync_intra
0000000000005248 t
mca_coll_ml_generic_collectives_launcher
U mca_coll_ml_memsync_intra
This appears to be via the following chain of calls
within coll_ml.h:
mca_coll_ml_generic_collectives_launcher
mca_coll_ml_task_completion_processing
coll_ml_fragment_completion_processing
mca_coll_ml_buffer_recycling
mca_coll_ml_memsync_intra
All of which are marked as "static
inline __opal_attribute_always_inline__".
-Paul
On Fri, Aug 24, 2012 at 4:55 PM,
Paul Hargrove
<phhargrove@lbl.gov>
wrote:
OK, I have a vanilla configure+make
running on both SPARC/Solaris-10 and AMD64/Solaris-11.
I am using the 12.3 Oracle compilers in both cases
to match the original report.
I'll post the results when they complete.
In the meantime, I took a quick look at the code
and have a pretty reasonable guess as to the cause.
Looking at ompi/mca/coll/ml/coll_ml.h I see:
827 int
mca_coll_ml_memsync_intra(mca_coll_ml_module_t
*module, int bank_index);
[...]
996 static inline
__opal_attribute_always_inline__
997 int
mca_coll_ml_buffer_recycling(mca_coll_ml_collective_operation_progress_t
*ml_request)
998 {
[...]
1023 rc =
mca_coll_ml_memsync_intra(ml_module,
ml_memblock->memsync_counter);
[...]
1041 }
Based on past experience w/ the Sun/Oracle
compilers on another project (See
http://bugzilla.hcs.ufl.edu/cgi-bin/bugzilla3/show_bug.cgi?id=193 ),
I suspect that this static-inline-always function is
being emitted by the compiler in every object which
includes this header even if they don't call it..
The call on line 1023 then results in the undefined
reference to mca_coll_ml_memsync_intra. Basically
it is not safe for an inline function in a header to
call an extern function that isn't available to
every object that includes the header REGARDLESS of
whether the object invokes the inline function or
not.
-Paul
On Fri, Aug 24, 2012 at
4:40 PM, Ralph Castain
<rhc@open-mpi.org>
wrote:
Oracle uses
an abysmally complicated configure line, but
nearly all of it is irrelevant to the problem
here. For this, I would suggest just doing a
vanilla ./configure - if the component gets
pulled into libmpi, then we know there is a
problem.
Thanks!
Just FYI: here is there actual configure
line, just in case you spot something
problematic:
CC=cc CXX=CC F77=f77 FC=f90 --with-openib --enable-openib-connectx-xrc --without-udapl
--disable-openib-ibcm --enable-btl-openib-failover --without-dtrace --enable-heterogeneous
--enable-cxx-exceptions --enable-shared --enable-orterun-prefix-by-default --with-sge
--enable-mpi-f90 --with-mpi-f90-size=small --disable-peruse --disable-state
--disable-mpi-thread-multiple --disable-debug --disable-mem-debug --disable-mem-profile
CFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch -xprefetch_level=2 -xvector=lib -Qoption
cg -xregs=no%appl -xdepend=yes -xbuiltin=%all -xO5" CXXFLAGS="-xtarget=ultra3 -m32
-xarch=sparcvis2 -xprefetch -xprefetch_level=2 -xvector=lib -Qoption cg -xregs=no%appl -xdepend=yes
-xbuiltin=%all -xO5 -Bstatic -lCrun -lCstd -Bdynamic" FFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2
-xprefetch -xprefetch_level=2 -xvector=lib -Qoption cg -xregs=no%appl -stackvar -xO5"
FCFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch -xprefetch_level=2 -xvector=lib -Qoption
cg -xregs=no%appl -stackvar -xO5"
--prefix=/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/installs/JA08/install
--mandir=${prefix}/man --bindir=${prefix}/bin --libdir=${prefix}/lib
--includedir=${prefix}/include --with-tm=/ws/ompi-tools/orte/torque/current/shared-install32
--enable-contrib-no-build=vt --with-package-string="Oracle Message Passing Toolkit "
--with-ident-string="@(#)RELEASE VERSION 1.9openmpi-1.5.4-r1.9a1r27092"
and the error he gets is:
make[2]: Entering directory
`/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi/tools/ompi_info'
CCLD ompi_info
Undefined first referenced
symbol in file
mca_coll_ml_memsync_intra ../../../ompi/.libs/libmpi.so
ld: fatal: symbol referencing errors. No output written to .libs/ompi_info
make[2]: *** [ompi_info] Error 2
make[2]: Leaving directory
`/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi/tools/ompi_info'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory
`/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi'
make: *** [install-recursive] Error 1
I have access to a
few different Solaris machines and can
offer to build the trunk if somebody
tells me what configure flags are
desired.
-Paul
On Fri, Aug
24, 2012 at 8:54 AM, Ralph Castain
<rhc@open-mpi.org>
wrote:
Eugene
- can you confirm that this is
only happening on the one Solaris
system? In other words, is this a
general issue or something
specific to that one machine?
I'm wondering because if it is
just the one machine, then it
might be something strange about
how it is setup - perhaps the
version of Solaris, or it is
configuring --enable-static, or...
Just trying to assess how general
a problem this might be, and thus
if this should be a blocker or
not.
On Aug 24, 2012, at 8:00 AM,
Eugene Loh <eugene.loh@oracle.com>
wrote:
> On 08/24/12 09:54, Shamis,
Pavel wrote:
>> Maybe there is a chance
to get direct access to this
system ?
> No.
>
> But I'm attaching compressed
log files from configure/make.
>
>
<tarball-of-log-files.tar.bz2>_______________________________________________
> devel mailing list
> devel@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Future Technologies Group
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Future Technologies Group
--
Future Technologies Group
Computer and Data Sciences Department Tel:
+1-510-495-2352
Lawrence Berkeley National Laboratory Fax:
+1-510-486-6900
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel