Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Scalability issue
From: Benjamin Toueg (btoueg_at_[hidden])
Date: 2010-12-05 19:17:51


Unfortunately DRAGON is old FORTRAN77. Integers have been used instead of
pointers. If I compile it in 64bits without -f-default-integer-8, the
so-called pointers will remain in 32bits. Problems could also arise from its
data structure handlers.

Therefore -f-default-integer-8 is absolutely necessary.

Futhermore MPI_SEND and MPI_RECEIVE are called a dozen times in only one
source file (used for passing a data structure from one node to another) and
it has proved to be working in every situtation.

Not knowing which line is causing my segfault is annoying. [?]

Regards,
Benjamin

2010/12/6 Gustavo Correa <gus_at_[hidden]>

> Hi Benjamin
>
> I would just rebuild OpenMPI withOUT the compiler flags that change the
> standard
> sizes of "int" and "float" (do a "make cleandist" first!), then recompile
> your program,
> and see how it goes.
> I don't think you are gaining anything by trying to change the standard
> "int/integer" and
> "real/float" sizdes, and most likely they are inviting trouble, making
> things more confusing.
> Worst scenario, you will at least be sure that the bug is somewhere else,
> not on the mismatch
> of basic type sizes.
>
> If you need to pass 8-byte real buffers, use MPI_DOUBLE_PRECISION, or
> MPI_REAL8
> in your (Fortran) MPI calls, and declare them in the Fortran code
> accordingly
> (double precision or real(kind=8)).
>
> If I remember right, there is no 8-byte integer support in the Fortran MPI
> bindings,
> only in the C bindings, but some OpenMPI expert could clarify this.
> Hence, if you are passing 8-byte integers in your MPI calls this may be
> also problematic.
>
> My two cents,
> Gus Correa
>
> On Dec 5, 2010, at 3:04 PM, Benjamin Toueg wrote:
>
> > Hi,
> >
> > First of all thanks for your insight !
> >
> > Do you get a corefile?
> > I don't get a core file, but I get a file called _FIL001. It doesn't
> contain any debugging symbols. It's most likely a digested version of the
> input file given to the executable : ./myexec < inputfile.
> >
> > there's no line numbers printed in the stack trace
> > I would love to see those, but even if I compile openmpi with -debug
> -mem-debug -mem-profile, they don't show up. I recompiled my sources to be
> sure to properly link them to the newly debugged version of openmpi. I
> assumed I didn't need to compile my own sources with -g option since it
> crashes in openmpi itself ? I didn't try to run mpiexec via gdb either, I
> guess it wont help since I already get the trace.
> >
> > the -fdefault-integer-8 options ought to be highly dangerous
> > Thanks for noting. Indeed I had some issues with this option. For
> instance I have to declare some arguments as INTEGER*4 like RANK,SIZE,IERR
> in :
> > CALL MPI_COMM_RANK(MPI_COMM_WORLD,RANK,IERR)
> > CALL MPI_COMM_SIZE(MPI_COMM_WORLD,SIZE,IERR)
> > In your example "call MPI_Send(buf, count, MPI_INTEGER, dest, tag,
> MPI_COMM_WORLD, mpierr)" I checked that count is never bigger than 2000 (as
> you mentioned it could flip to the negative). However I haven't declared it
> as INTEGER*4 and I think I should.
> > When I said "I had to raise the number of data strucutures to be sent", I
> meant that I had to call MPI_SEND many more times, not that buffers were
> bigger than before.
> >
> > I'll get back to you with more info when I'll be able to fix my connexion
> problem to the cluster...
> >
> > Thanks,
> > Benjamin
> >
> > 2010/12/3 Martin Siegert <siegert_at_[hidden]>
> > Hi All,
> >
> > just to expand on this guess ...
> >
> > On Thu, Dec 02, 2010 at 05:40:53PM -0500, Gus Correa wrote:
> > > Hi All
> > >
> > > I wonder if configuring OpenMPI while
> > > forcing the default types to non-default values
> > > (-fdefault-integer-8 -fdefault-real-8) might have
> > > something to do with the segmentation fault.
> > > Would this be effective, i.e., actually make the
> > > the sizes of MPI_INTEGER/MPI_INT and MPI_REAL/MPI_FLOAT bigger,
> > > or just elusive?
> >
> > I believe what happens is that this mostly affects the fortran
> > wrapper routines and the way Fortran variables are mapped to C:
> >
> > MPI_INTEGER -> MPI_LONG
> > MPI_FLOAT -> MPI_DOUBLE
> > MPI_DOUBLE_PRECISION -> MPI_DOUBLE
> >
> > In that respect I believe that the -fdefault-real-8 option is harmless,
> > i.e., it does the expected thing.
> > But the -fdefault-integer-8 options ought to be highly dangerous:
> > It works for integer variables that are used as "buffer" arguments
> > in MPI statements, but I would assume that this does not work for
> > "count" and similar arguments.
> > Example:
> >
> > integer, allocatable :: buf(*,*)
> > integer i, count, dest, tag, mpierr
> >
> > i = 32768
> > i2 = 2*i
> > allocate(buf(i,i2))
> > count = i*i2
> > buf = 1
> > call MPI_Send(buf, count, MPI_INTEGER, dest, tag, MPI_COMM_WORLD, mpierr)
> >
> > Now count is 2^31 which overflows a 32bit integer.
> > The MPI standard requires that count is a 32bit integer, correct?
> > Thus while buf gets the type MPI_LONG, count remains an int.
> > Is this interpretation correct? If it is, then you are calling
> > MPI_Send with a count argument of -2147483648.
> > Which could result in a segmentation fault.
> >
> > Cheers,
> > Martin
> >
> > --
> > Martin Siegert
> > Head, Research Computing
> > WestGrid/ComputeCanada Site Lead
> > IT Services phone: 778 782-4691
> > Simon Fraser University fax: 778 782-4242
> > Burnaby, British Columbia email: siegert_at_[hidden]
> > Canada V5A 1S6
> >
> > > There were some recent discussions here about MPI
> > > limiting counts to MPI_INTEGER.
> > > Since Benjamin said he "had to raise the number of data structures",
> > > which eventually led to the the error,
> > > I wonder if he is inadvertently flipping to negative integer
> > > side of the 32-bit universe (i.e. >= 2**31), as was reported here by
> > > other list subscribers a few times.
> > >
> > > Anyway, segmentation fault can come from many different places,
> > > this is just a guess.
> > >
> > > Gus Correa
> > >
> > > Jeff Squyres wrote:
> > > >Do you get a corefile?
> > > >
> > > >It looks like you're calling MPI_RECV in Fortran and then it segv's.
> This is *likely* because you're either passing a bad parameter or your
> buffer isn't big enough. Can you double check all your parameters?
> > > >
> > > >Unfortunately, there's no line numbers printed in the stack trace, so
> it's not possible to tell exactly where in the ob1 PML it's dying (i.e., so
> we can't see exactly what it's doing to cause the segv).
> > > >
> > > >
> > > >
> > > >On Dec 2, 2010, at 9:36 AM, Benjamin Toueg wrote:
> > > >
> > > >>Hi,
> > > >>
> > > >>I am using DRAGON, a neutronic simulation code in FORTRAN77 that has
> its own datastructures. I added a module to send these data structures
> thanks to MPI_SEND / MPI_RECEIVE, and everything worked perfectly for a
> while.
> > > >>
> > > >>Then I had to raise the number of data structures to be sent up to a
> point where my cluster has this bug :
> > > >>*** Process received signal ***
> > > >>Signal: Segmentation fault (11)
> > > >>Signal code: Address not mapped (1)
> > > >>Failing at address: 0x2c2579fc0
> > > >>[ 0] /lib/libpthread.so.0 [0x7f52d2930410]
> > > >>[ 1] /home/toueg/openmpi/lib/openmpi/mca_pml_ob1.so [0x7f52d153fe03]
> > > >>[ 2] /home/toueg/openmpi/lib/libmpi.so.0(PMPI_Recv+0x2d2)
> [0x7f52d3504a1e]
> > > >>[ 3] /home/toueg/openmpi/lib/libmpi_f77.so.0(pmpi_recv_+0x10e)
> [0x7f52d36cf9c6]
> > > >>
> > > >>How can I make this error more explicit ?
> > > >>
> > > >>I use the following configuration of openmpi-1.4.3 :
> > > >>./configure --enable-debug --prefix=/home/toueg/openmpi CXX=g++
> CC=gcc F77=gfortran FC=gfortran FLAGS="-m64 -fdefault-integer-8
> -fdefault-real-8 -fdefault-double-8" FCFLAGS="-m64 -fdefault-integer-8
> -fdefault-real-8 -fdefault-double-8" --disable-mpi-f90
> > > >>
> > > >>Here is the output of mpif77 -v :
> > > >>mpif77 for 1.2.7 (release) of : 2005/11/04 11:54:51
> > > >>Driving: f77 -L/usr/lib/mpich-mpd/lib -v -lmpich-p4mpd -lpthread -lrt
> -lfrtbegin -lg2c -lm -shared-libgcc
> > > >>Lecture des spécification à partir de
> /usr/lib/gcc/x86_64-linux-gnu/3.4.6/specs
> > > >>Configuré avec: ../src/configure -v
> --enable-languages=c,c++,f77,pascal --prefix=/usr --libexecdir=/usr/lib
> --with-gxx-include-dir=/usr/include/c++/3.4 --enable-shared
> --with-system-zlib --enable-nls --without-included-gettext
> --program-suffix=-3.4 --enable-__cxa_atexit --enable-clocale=gnu
> --enable-libstdcxx-debug x86_64-linux-gnu
> > > >>Modèle de thread: posix
> > > >>version gcc 3.4.6 (Debian 3.4.6-5)
> > > >> /usr/lib/gcc/x86_64-linux-gnu/3.4.6/collect2 --eh-frame-hdr -m
> elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2
> /usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/crt1.o
> /usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/crti.o
> /usr/lib/gcc/x86_64-linux-gnu/3.4.6/crtbegin.o -L/usr/lib/mpich-mpd/lib
> -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6 -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6
> -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib
> -L/usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../.. -L/lib/../lib
> -L/usr/lib/../lib -lmpich-p4mpd -lpthread -lrt -lfrtbegin -lg2c -lm -lgcc_s
> -lgcc -lc -lgcc_s -lgcc /usr/lib/gcc/x86_64-linux-gnu/3.4.6/crtend.o
> /usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/crtn.o
> > >
> >>/usr/lib/gcc/x86_64-linux-gnu/3.4.6/../../../../lib/libfrtbegin.a(frtbegin.o):
> dans la fonction â–’ main â–’:
> > > >>(.text+0x1e): référence indéfinie vers ▒ MAIN__ ▒
> > > >>collect2: ld a retourné 1 code d'état d'exécution
> > > >>
> > > >>Thanks,
> > > >>Benjamin
> > > >>
> > > >>_______________________________________________
> > > >>users mailing list
> > > >>users_at_[hidden]
> > > >>http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >
> > > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>




323.gif