Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Mohammad Huwaidi (mohammad_at_[hidden])
Date: 2007-03-18 21:47:55


Thanks Jeff.

The kind of faults I was trying to trap are those of application/node
faults/failures. I literally kill the application on another node in
hope to try to trap it and react accordingly. This is similar to FT-MPI
shrinking the size, etc.

If you suggest a different implementation that will allow me to trap
please let me know.

Regards,
Mohammad Huwaidi

users-request_at_[hidden] wrote:
> Send users mailing list submissions to
> users_at_[hidden]
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-request_at_[hidden]
>
> You can reach the person managing the list at
> users-owner_at_[hidden]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
> 1. open-mpi 1.2 build failure under Mac OS X 10.3.9
> (Marius Schamschula)
> 2. Re: OpenMPI 1.2 bug: segmentation violation in mpi_pack
> (Jeff Squyres)
> 3. Re: Fault Tolerance (Jeff Squyres)
> 4. Re: Signal 13 (Ralph Castain)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 16 Mar 2007 18:42:22 -0500
> From: Marius Schamschula <marius_at_[hidden]>
> Subject: [OMPI users] open-mpi 1.2 build failure under Mac OS X 10.3.9
> To: users_at_[hidden]
> Message-ID: <82367DB0-EBC6-4438-BBC2-D7896318633E_at_[hidden]>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi all,
>
> I was building open-mpi 1.2 on my G4 running Mac OS X 10.3.9 and had
> a build failure with the following:
>
> depbase=`echo runtime/ompi_mpi_preconnect.lo | sed 's|[^/]*$|.deps/
> &|;s|\.lo$||'`; \
> if /bin/sh ../libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I.
> -I. -I../opal/include -I../orte/include -I../ompi/include -I../ompi/
> include -I.. -D_REENTRANT -O3 -DNDEBUG -finline-functions -fno-
> strict-aliasing -MT runtime/ompi_mpi_preconnect.lo -MD -MP -MF
> "$depbase.Tpo" -c -o runtime/ompi_mpi_preconnect.lo runtime/
> ompi_mpi_preconnect.c; \
> then mv -f "$depbase.Tpo" "$depbase.Plo"; else rm -f "$depbase.Tpo";
> exit 1; fi
> libtool: compile: gcc -DHAVE_CONFIG_H -I. -I. -I../opal/include -I../
> orte/include -I../ompi/include -I../ompi/include -I.. -D_REENTRANT -
> O3 -DNDEBUG -finline-functions -fno-strict-aliasing -MT runtime/
> ompi_mpi_preconnect.lo -MD -MP -MF runtime/.deps/
> ompi_mpi_preconnect.Tpo -c runtime/ompi_mpi_preconnect.c -fno-common
> -DPIC -o runtime/.libs/ompi_mpi_preconnect.o
> runtime/ompi_mpi_preconnect.c: In function
> `ompi_init_do_oob_preconnect':
> runtime/ompi_mpi_preconnect.c:74: error: storage size of `msg' isn't
> known
> make[2]: *** [runtime/ompi_mpi_preconnect.lo] Error 1
> make[1]: *** [all-recursive] Error 1
> make: *** [all-recursive] Error 1
>
> $ gcc -v
> Reading specs from /usr/libexec/gcc/darwin/ppc/3.3/specs
> Thread model: posix
> gcc version 3.3 20030304 (Apple Computer, Inc. build 1495)
>
> $ g77 -v
> Reading specs from /usr/local/lib/gcc/powerpc-apple-darwin7.3.0/3.5.0/
> specs
> Configured with: ../gcc/configure --enable-threads=posix --enable-
> languages=f77
> Thread model: posix
> gcc version 3.5.0 20040429 (experimental)
>
> (g77 from hpc.sf.net)
>
>
> Note: I had no such problem under Mac OS X 10.4.9 with my ppc and x86
> builds. However, I did notice that the configure script did not
> detect g95 from g95.org correctly:
>
> *** Fortran 90/95 compiler
> checking for gfortran... no
> checking for f95... no
> checking for fort... no
> checking for xlf95... no
> checking for ifort... no
> checking for ifc... no
> checking for efc... no
> checking for pgf95... no
> checking for lf95... no
> checking for f90... no
> checking for xlf90... no
> checking for pgf90... no
> checking for epcf90... no
> checking whether we are using the GNU Fortran compiler... no
>
> configure --help doesn't give any hint about specifying F95.
>
> TIA,
>
> Marius
> --
> Marius Schamschula, Alabama A & M University, Department of Physics
>
> The Center for Hydrology Soil Climatology and Remote Sensing
> http://optics.physics.aamu.edu/ - http://www.physics.aamu.edu/
> http://wx.aamu.edu/ - http://www.aamu.edu/hscars/
>
>
>
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> Message: 2
> Date: Fri, 16 Mar 2007 19:46:39 -0400
> From: Jeff Squyres <jsquyres_at_[hidden]>
> Subject: Re: [OMPI users] OpenMPI 1.2 bug: segmentation violation in
> mpi_pack
> To: Open MPI Users <users_at_[hidden]>
> Message-ID: <045DABAC-1369-4E45-8E0C-FD9FBA13C95F_at_[hidden]>
> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>
> The problem with both the f77 and f90 programs is that you forgot to
> put "ierr" as the last argument to MPI_PACK. This causes a segv
> because neither of them are correct MPI programs.
>
> But it's always good to hear that we can deliver a smaller corefile
> in v1.2! :-)
>
>
> On Mar 16, 2007, at 7:25 PM, Erik Deumens wrote:
>
>> I have a small program in F77 that makes a SEGV crash with
>> a 130MB core file. It is true that the crash is much cleaner
>> in OpenMPI 1.2; nice improvement! The core file is 500 MB with
>> OpenMPI 1.1.
>>
>> I am running on CentOS 4.4 with the latest patches.
>>
>> mpif77 -g -o bug bug.f
>> mpirun -np 2 ./bug
>>
>> I also have a bug.f90 (which I made first) and it crashes
>> too with the Intel ifort compiler 9.1.039.
>>
>> --
>> Dr. Erik Deumens
>> Interim Director
>> Quantum Theory Project
>> New Physics Building 2334 deumens_at_[hidden]
>> University of Florida http://www.qtp.ufl.edu/~deumens
>> Gainesville, Florida 32611-8435 (352)392-6980
>>
>> program mainf
>> c mpif77 -g -o bug bug.f
>> c mpirun -np 2 ./bug
>> implicit none
>> include 'mpif.h'
>> character*80 inpfile
>> integer l
>> integer i
>> integer stat
>> integer cmdbuf(4)
>> integer lcmdbuf
>> integer ierr
>> integer ntasks
>> integer taskid
>> integer bufpos
>> integer cmd
>> integer ldata
>> character*(mpi_max_processor_name) hostnm
>> integer iuinp
>> integer iuout
>> integer lnam
>> real*8 bcaststart
>> iuinp = 5
>> iuout = 6
>> lcmdbuf = 16
>> i = 0
>> call mpi_init(ierr)
>> call mpi_comm_size (mpi_comm_world, ntasks, ierr)
>> call mpi_comm_rank (mpi_comm_world, taskid, ierr)
>> hostnm = ' '
>> call mpi_get_processor_name (hostnm, lnam, ierr)
>> write (iuout,*) 'task',taskid,'of',ntasks,'on ',hostnm(1:lnam)
>> if (taskid == 0) then
>> inpfile = ' '
>> do
>> write (iuout,*) 'Enter .inp filename:'
>> read (iuinp,*) inpfile
>> if (inpfile /= ' ') exit
>> end do
>> l = len_trim(inpfile)
>> write (iuout,*) 'task',taskid,inpfile(1:l)
>> bufpos = 0
>> cmd = 1099
>> ldata = 7
>> write (iuout,*) 'task',taskid,cmdbuf,bufpos
>> write (iuout,*) 'task',taskid,cmd,lcmdbuf
>> call mpi_pack (cmd, 1, MPI_INTEGER,
>> * cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
>> write (iuout,*) 'task',taskid,cmdbuf,bufpos
>> write (iuout,*) 'task',taskid,ldata,lcmdbuf
>> call mpi_pack (ldata, 1, MPI_INTEGER,
>> * cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
>> bcaststart = mpi_wtime()
>> write (iuout,*) 'task',taskid,cmdbuf,bufpos
>> write (iuout,*) 'task',taskid,bcaststart,lcmdbuf
>> call mpi_pack (bcaststart, 1, MPI_DOUBLE_PRECISION,
>> * cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
>> write (iuout,*) 'task',taskid,cmdbuf,bufpos
>> end if
>> call mpi_bcast (cmdbuf, lcmdbuf, MPI_PACKED,
>> * 0, MPI_COMM_WORLD, ierr)
>> call mpi_finalize(ierr)
>> stop
>> end program mainf
>>
>> program mainf
>> ! ifort -g -I /share/local/lib/ompi/include -o bug bug.f90
>> ! -L /share/local/lib/ompi/lib -lmpi_f77 -lmpi
>> ! mpirun -np 2 ./bug
>> implicit none
>> include 'mpif.h'
>> character(len=80) :: inpfile
>> character(len=1), dimension(80) :: cinpfile
>> integer :: l
>> integer :: i
>> integer :: stat
>> integer, dimension(4) :: cmdbuf
>> integer :: lcmdbuf
>> integer :: ierr
>> integer :: ntasks
>> integer :: taskid
>> integer :: bufpos
>> integer :: cmd
>> integer :: ldata
>> character(len=mpi_max_processor_name) :: hostnm
>> integer :: iuinp = 5
>> integer :: iuout = 6
>> integer :: lnam
>> real(8) :: bcaststart
>> lcmdbuf = 16
>> i = 0
>> call mpi_init(ierr)
>> call mpi_comm_size (mpi_comm_world, ntasks, ierr)
>> call mpi_comm_rank (mpi_comm_world, taskid, ierr)
>> hostnm = ' '
>> call mpi_get_processor_name (hostnm, lnam, ierr)
>> write (iuout,*) 'task',taskid,'of',ntasks,'on ',hostnm(1:lnam)
>> if (taskid == 0) then
>> inpfile = ' '
>> do
>> write (iuout,*) 'Enter .inp filename:'
>> read (iuinp,*) inpfile
>> if (inpfile /= ' ') exit
>> end do
>> l = len_trim(inpfile)
>> do i=1,l
>> cinpfile(i) = inpfile(i:i)
>> end do
>> cinpfile(l+1) = char(0)
>> write (iuout,*) 'task',taskid,inpfile(1:l)
>> bufpos = 0
>> cmd = 1099
>> ldata = 7
>> write (iuout,*) 'task',taskid,cmdbuf,bufpos
>> ! The next two lines exhibit the bug
>> ! Uncomment the first and the program works
>> ! Uncomment the second and the program dies in mpi_pack
>> ! and produces a 137 MB core file.
>> write (iuout,*) 'task',taskid,cmd,lcmdbuf
>> ! write (iuout,*) 'task',taskid,cmd
>> call mpi_pack (cmd, 1, MPI_INTEGER, &
>> cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
>> write (iuout,*) 'task',taskid,cmdbuf,bufpos
>> write (iuout,*) 'task',taskid,ldata,lcmdbuf
>> call mpi_pack (ldata, 1, MPI_INTEGER, &
>> cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
>> bcaststart = mpi_wtime()
>> write (iuout,*) 'task',taskid,cmdbuf,bufpos
>> write (iuout,*) 'task',taskid,bcaststart,lcmdbuf
>> call mpi_pack (bcaststart, 1, MPI_DOUBLE_PRECISION, &
>> cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
>> write (iuout,*) 'task',taskid,cmdbuf,bufpos
>> end if
>> call mpi_bcast (cmdbuf, lcmdbuf, MPI_PACKED, &
>> 0, MPI_COMM_WORLD, ierr)
>> call mpi_finalize(ierr)
>> stop
>> end program mainf
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

-- 
Regards,
Mohammad Huwaidi
We can't resolve problems by using the same kind of thinking we used
when we created them.
                                                 --Albert Einstein