Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault with HPCC benchmark
From: Gus Correa (gus_at_[hidden])
Date: 2013-04-03 15:24:04


Hi Reza

It is hard to guess with little information.
Other things you could check:

1) Are you allowed to increase the stack size (say,
by the sys admin in limits.conf)?
If using a job queue system,
does it limit the stack size somehow?

2) If you can compile and
run the Open MPI examples (hello_c.c, ring_c.c, connectivity_c.c),
then it is unlikely that the problem is with Open MPI.
This is kind of a first line of defense to diagnose this type
of problem and the health of your Open MPI installation.

Your error message says "Connection reset by peer", so
I wonder if there is any firewall or other network roadblock
or configuration issue. Worth testing Open MPI
with simpler MPI programs,
and even (for network setup) with shell commands like "hostname".

3) Make sure there is no mixup of MPI implementations (e.g. MPICH
and Open MPI) or versions, both for mpicc and mpiexec.
Make sure the LD_LIBRARY_PATH is pointing to the right OpenMPI
lib location (and to the right BLAS/LAPACK location, for that matter).

4) No mixup of architectures either (32 vs 64 bit).
I wonder why your Open MPI library is installed in
/usr/lib/openmpi not /usr/lib64,
but your HPL ARCH = intel64 and everything else seems to be x86_64.
If you apt-get an Open MPI package, check if it is
i386 or x86_64.
(It may be simpler to download and install
the Open MPI tarball in /usr/local or in your home directory.)

5) Check if you are using a threaded or OpenMP enabled
BLAS/Lapack library or a number of threads greater than 1.

6) Is the problem size (N) in your HPL.dat parameter file
consistent with the physical memory available?

I hope this helps,
Gus Correa

On 04/03/2013 02:32 PM, Ralph Castain wrote:
> I agree with Gus - check your stack size. This isn't occurring in OMPI
> itself, so I suspect it is in the system setup.
>
>
> On Apr 3, 2013, at 10:17 AM, Reza Bakhshayeshi <reza.b2008_at_[hidden]
> <mailto:reza.b2008_at_[hidden]>> wrote:
>
>> Thanks for your answers.
>>
>> @Ralph Castain:
>> Do you mean what error I receive?
>> It's the output when I'm running the program:
>>
>> *** Process received signal ***
>> Signal: Segmentation fault (11)
>> Signal code: Address not mapped (1)
>> Failing at address: 0x1b7f000
>> [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f6a84b524a0]
>> [ 1] hpcc(HPCC_Power2NodesMPIRandomAccessCheck+0xa04) [0x423834]
>> [ 2] hpcc(HPCC_MPIRandomAccess+0x87a) [0x41e43a]
>> [ 3] hpcc(main+0xfbf) [0x40a1bf]
>> [ 4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)
>> [0x7f6a84b3d76d]
>> [ 5] hpcc() [0x40aafd]
>> *** End of error message ***
>> [
>> ][[53938,1],0][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
>> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 1 with PID 4164 on node 192.168.100.6
>> exited on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>>
>> @Gus Correa:
>> I did it both on server and on instances but it didn't solve the problem.
>>
>>
>> On 3 April 2013 19:14, Gus Correa <gus_at_[hidden]
>> <mailto:gus_at_[hidden]>> wrote:
>>
>> Hi Reza
>>
>> Check the system stacksize first ('limit stacksize' or 'ulimit -s').
>> If it is small, you can try to increase it
>> before you run the program.
>> Say (tcsh):
>>
>> limit stacksize unlimited
>>
>> or (bash):
>>
>> ulimit -s unlimited
>>
>> I hope this helps,
>> Gus Correa
>>
>>
>> On 04/03/2013 10:29 AM, Ralph Castain wrote:
>>
>> Could you perhaps share the stacktrace from the segfault? It's
>> impossible to advise you on the problem without seeing it.
>>
>>
>> On Apr 3, 2013, at 5:28 AM, Reza Bakhshayeshi
>> <reza.b2008_at_[hidden] <mailto:reza.b2008_at_[hidden]>
>> <mailto:reza.b2008_at_[hidden] <mailto:reza.b2008_at_[hidden]>>>
>> wrote:
>>
>> ​Hi
>> ​​I have installed HPCC benchmark suite and openmpi on a
>> private cloud
>> instances.
>> Unfortunately I get Segmentation fault error mostly when I
>> want to run
>> it simultaneously on two or more instances with:
>> mpirun -np 2 --hostfile ./myhosts hpcc
>>
>> Everything is on Ubuntu server 12.04 (updated)
>> and this is my make.intel64 file:
>>
>> shell
>> ------------------------------__------------------------------__--
>> #
>> ------------------------------__------------------------------__----------
>> #
>> SHELL = /bin/sh
>> #
>> CD = cd
>> CP = cp
>> LN_S = ln -s
>> MKDIR = mkdir
>> RM = /bin/rm -f
>> TOUCH = touch
>> #
>> #
>> ------------------------------__------------------------------__----------
>> # - Platform identifier
>> ------------------------------__------------------
>> #
>> ------------------------------__------------------------------__----------
>> #
>> ARCH = intel64
>> #
>> #
>> ------------------------------__------------------------------__----------
>> # - HPL Directory Structure / HPL library
>> ------------------------------
>> #
>> ------------------------------__------------------------------__----------
>> #
>> TOPdir = ../../..
>> INCdir = $(TOPdir)/include
>> BINdir = $(TOPdir)/bin/$(ARCH)
>> LIBdir = $(TOPdir)/lib/$(ARCH)
>> #
>> HPLlib = $(LIBdir)/libhpl.a
>> #
>> #
>> ------------------------------__------------------------------__----------
>> # - Message Passing library (MPI)
>> ------------------------------__--------
>> #
>> ------------------------------__------------------------------__----------
>> # MPinc tells the C compiler where to find the Message
>> Passing library
>> # header files, MPlib is defined to be the name of the
>> library to be
>> # used. The variable MPdir is only used for defining MPinc
>> and MPlib.
>> #
>> MPdir = /usr/lib/openmpi
>> MPinc = -I$(MPdir)/include
>> MPlib = $(MPdir)/lib/libmpi.so
>> #
>> #
>> ------------------------------__------------------------------__----------
>> # - Linear Algebra library (BLAS or VSIPL)
>> -----------------------------
>> #
>> ------------------------------__------------------------------__----------
>> # LAinc tells the C compiler where to find the Linear
>> Algebra library
>> # header files, LAlib is defined to be the name of the
>> library to be
>> # used. The variable LAdir is only used for defining LAinc
>> and LAlib.
>> #
>> LAdir = /usr/local/ATLAS/obj64
>> LAinc = -I$(LAdir)/include
>> LAlib = $(LAdir)/lib/libcblas.a $(LAdir)/lib/libatlas.a
>> #
>> #
>> ------------------------------__------------------------------__----------
>> # - F77 / C interface
>> ------------------------------__--------------------
>> #
>> ------------------------------__------------------------------__----------
>> # You can skip this section if and only if you are not
>> planning to use
>> # a BLAS library featuring a Fortran 77 interface.
>> Otherwise, it is
>> # necessary to fill out the F2CDEFS variable with the
>> appropriate
>> # options. **One and only one** option should be chosen in
>> **each** of
>> # the 3 following categories:
>> #
>> # 1) name space (How C calls a Fortran 77 routine)
>> #
>> # -DAdd_ : all lower case and a suffixed underscore (Suns,
>> # Intel, ...), [default]
>> # -DNoChange : all lower case (IBM RS6000),
>> # -DUpCase : all upper case (Cray),
>> # -DAdd__ : the FORTRAN compiler in use is f2c.
>> #
>> # 2) C and Fortran 77 integer mapping
>> #
>> # -DF77_INTEGER=int : Fortran 77 INTEGER is a C int, [default]
>> # -DF77_INTEGER=long : Fortran 77 INTEGER is a C long,
>> # -DF77_INTEGER=short : Fortran 77 INTEGER is a C short.
>> #
>> # 3) Fortran 77 string handling
>> #
>> # -DStringSunStyle : The string address is passed at the
>> string loca-
>> # tion on the stack, and the string length is then
>> # passed as an F77_INTEGER after all explicit
>> # stack arguments, [default]
>> # -DStringStructPtr : The address of a structure is passed
>> by a
>> # Fortran 77 string, and the structure is of the
>> # form: struct {char *cp; F77_INTEGER len;},
>> # -DStringStructVal : A structure is passed by value for
>> each Fortran
>> # 77 string, and the structure is of the form:
>> # struct {char *cp; F77_INTEGER len;},
>> # -DStringCrayStyle : Special option for Cray machines,
>> which uses
>> # Cray fcd (fortran character descriptor) for
>> # interoperation.
>> #
>> F2CDEFS =
>> #
>> #
>> ------------------------------__------------------------------__----------
>> # - HPL includes / libraries / specifics
>> ------------------------------__-
>> #
>> ------------------------------__------------------------------__----------
>> #
>> HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc)
>> $(MPinc)
>> HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib) -lm
>> #
>> # - Compile time options
>> ------------------------------__-----------------
>> #
>> # -DHPL_COPY_L force the copy of the panel L before bcast;
>> # -DHPL_CALL_CBLAS call the cblas interface;
>> # -DHPL_CALL_VSIPL call the vsip library;
>> # -DHPL_DETAILED_TIMING enable detailed timers;
>> #
>> # By default HPL will:
>> # *) not copy L before broadcast,
>> # *) call the BLAS Fortran 77 interface,
>> # *) not display detailed timing information.
>> #
>> HPL_OPTS = -DHPL_CALL_CBLAS
>> #
>> #
>> ------------------------------__------------------------------__----------
>> #
>> HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
>> #
>> #
>> ------------------------------__------------------------------__----------
>> # - Compilers / linkers - Optimization flags
>> ---------------------------
>> #
>> ------------------------------__------------------------------__----------
>> #
>> CC = /usr/bin/mpicc
>> CCNOOPT = $(HPL_DEFS)
>> CCFLAGS = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops
>> #CCFLAGS = $(HPL_DEFS)
>> #
>> # On some platforms, it is necessary to use the Fortran
>> linker to find
>> # the Fortran internals used in the BLAS library.
>> #
>> LINKER = /usr/bin/mpif90
>> LINKFLAGS = $(CCFLAGS)
>> #
>> ARCHIVER = ar
>> ARFLAGS = r
>> RANLIB = echo
>> #
>> #
>> ------------------------------__------------------------------__----------
>>
>> Would you mind please help me figure this problem out?
>>
>> Regards,
>> Reza
>> _________________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
>> http://www.open-mpi.org/__mailman/listinfo.cgi/users
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>
>>
>>
>>
>> _________________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/__mailman/listinfo.cgi/users
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>
>>
>> _________________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/__mailman/listinfo.cgi/users
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users