Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-03-19 07:32:27


If you're looking for true fault tolerance, OMPI doesn't have it
yet. An audit for the code base to ensure that errors are
continuable is planned, but is not currently on the roadmap.

The FT-MPI guys can comment on their timetable for bringing that
technology in...

On Mar 18, 2007, at 9:47 PM, Mohammad Huwaidi wrote:

> Thanks Jeff.
>
> The kind of faults I was trying to trap are those of application/
> node faults/failures. I literally kill the application on another
> node in hope to try to trap it and react accordingly. This is
> similar to FT-MPI shrinking the size, etc.
>
> If you suggest a different implementation that will allow me to
> trap please let me know.
>
> Regards,
> Mohammad Huwaidi
>
> users-request_at_[hidden] wrote:
>> Send users mailing list submissions to
>> users_at_[hidden]
>> To subscribe or unsubscribe via the World Wide Web, visit
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> or, via email, send a message with subject or body 'help' to
>> users-request_at_[hidden]
>> You can reach the person managing the list at
>> users-owner_at_[hidden]
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of users digest..."
>> Today's Topics:
>> 1. open-mpi 1.2 build failure under Mac OS X 10.3.9
>> (Marius Schamschula)
>> 2. Re: OpenMPI 1.2 bug: segmentation violation in mpi_pack
>> (Jeff Squyres)
>> 3. Re: Fault Tolerance (Jeff Squyres)
>> 4. Re: Signal 13 (Ralph Castain)
>> ---------------------------------------------------------------------
>> -
>> Message: 1
>> Date: Fri, 16 Mar 2007 18:42:22 -0500
>> From: Marius Schamschula <marius_at_[hidden]>
>> Subject: [OMPI users] open-mpi 1.2 build failure under Mac OS X
>> 10.3.9
>> To: users_at_[hidden]
>> Message-ID: <82367DB0-EBC6-4438-BBC2-D7896318633E_at_[hidden]>
>> Content-Type: text/plain; charset="us-ascii"
>> Hi all,
>> I was building open-mpi 1.2 on my G4 running Mac OS X 10.3.9 and
>> had a build failure with the following:
>> depbase=`echo runtime/ompi_mpi_preconnect.lo | sed 's|[^/]*
>> $|.deps/ &|;s|\.lo$||'`; \
>> if /bin/sh ../libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -
>> I. -I. -I../opal/include -I../orte/include -I../ompi/include -I../
>> ompi/ include -I.. -D_REENTRANT -O3 -DNDEBUG -finline-
>> functions -fno- strict-aliasing -MT runtime/
>> ompi_mpi_preconnect.lo -MD -MP -MF "$depbase.Tpo" -c -o runtime/
>> ompi_mpi_preconnect.lo runtime/ ompi_mpi_preconnect.c; \
>> then mv -f "$depbase.Tpo" "$depbase.Plo"; else rm -f
>> "$depbase.Tpo"; exit 1; fi
>> libtool: compile: gcc -DHAVE_CONFIG_H -I. -I. -I../opal/include -
>> I../ orte/include -I../ompi/include -I../ompi/include -I.. -
>> D_REENTRANT - O3 -DNDEBUG -finline-functions -fno-strict-aliasing -
>> MT runtime/ ompi_mpi_preconnect.lo -MD -MP -MF runtime/.deps/
>> ompi_mpi_preconnect.Tpo -c runtime/ompi_mpi_preconnect.c -fno-
>> common -DPIC -o runtime/.libs/ompi_mpi_preconnect.o
>> runtime/ompi_mpi_preconnect.c: In function
>> `ompi_init_do_oob_preconnect':
>> runtime/ompi_mpi_preconnect.c:74: error: storage size of `msg'
>> isn't known
>> make[2]: *** [runtime/ompi_mpi_preconnect.lo] Error 1
>> make[1]: *** [all-recursive] Error 1
>> make: *** [all-recursive] Error 1
>> $ gcc -v
>> Reading specs from /usr/libexec/gcc/darwin/ppc/3.3/specs
>> Thread model: posix
>> gcc version 3.3 20030304 (Apple Computer, Inc. build 1495)
>> $ g77 -v
>> Reading specs from /usr/local/lib/gcc/powerpc-apple-
>> darwin7.3.0/3.5.0/ specs
>> Configured with: ../gcc/configure --enable-threads=posix --enable-
>> languages=f77
>> Thread model: posix
>> gcc version 3.5.0 20040429 (experimental)
>> (g77 from hpc.sf.net)
>> Note: I had no such problem under Mac OS X 10.4.9 with my ppc and
>> x86 builds. However, I did notice that the configure script did
>> not detect g95 from g95.org correctly:
>> *** Fortran 90/95 compiler
>> checking for gfortran... no
>> checking for f95... no
>> checking for fort... no
>> checking for xlf95... no
>> checking for ifort... no
>> checking for ifc... no
>> checking for efc... no
>> checking for pgf95... no
>> checking for lf95... no
>> checking for f90... no
>> checking for xlf90... no
>> checking for pgf90... no
>> checking for epcf90... no
>> checking whether we are using the GNU Fortran compiler... no
>> configure --help doesn't give any hint about specifying F95.
>> TIA,
>> Marius
>> --
>> Marius Schamschula, Alabama A & M University, Department of Physics
>> The Center for Hydrology Soil Climatology and Remote Sensing
>> http://optics.physics.aamu.edu/ - http://www.physics.aamu.edu/
>> http://wx.aamu.edu/ - http://www.aamu.edu/hscars/
>> -------------- next part --------------
>> HTML attachment scrubbed and removed
>> ------------------------------
>> Message: 2
>> Date: Fri, 16 Mar 2007 19:46:39 -0400
>> From: Jeff Squyres <jsquyres_at_[hidden]>
>> Subject: Re: [OMPI users] OpenMPI 1.2 bug: segmentation violation in
>> mpi_pack
>> To: Open MPI Users <users_at_[hidden]>
>> Message-ID: <045DABAC-1369-4E45-8E0C-FD9FBA13C95F_at_[hidden]>
>> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>> The problem with both the f77 and f90 programs is that you forgot
>> to put "ierr" as the last argument to MPI_PACK. This causes a
>> segv because neither of them are correct MPI programs.
>> But it's always good to hear that we can deliver a smaller
>> corefile in v1.2! :-)
>> On Mar 16, 2007, at 7:25 PM, Erik Deumens wrote:
>>> I have a small program in F77 that makes a SEGV crash with
>>> a 130MB core file. It is true that the crash is much cleaner
>>> in OpenMPI 1.2; nice improvement! The core file is 500 MB with
>>> OpenMPI 1.1.
>>>
>>> I am running on CentOS 4.4 with the latest patches.
>>>
>>> mpif77 -g -o bug bug.f
>>> mpirun -np 2 ./bug
>>>
>>> I also have a bug.f90 (which I made first) and it crashes
>>> too with the Intel ifort compiler 9.1.039.
>>>
>>> --
>>> Dr. Erik Deumens
>>> Interim Director
>>> Quantum Theory Project
>>> New Physics Building 2334 deumens_at_[hidden]
>>> University of Florida http://www.qtp.ufl.edu/~deumens
>>> Gainesville, Florida 32611-8435 (352)392-6980
>>>
>>> program mainf
>>> c mpif77 -g -o bug bug.f
>>> c mpirun -np 2 ./bug
>>> implicit none
>>> include 'mpif.h'
>>> character*80 inpfile
>>> integer l
>>> integer i
>>> integer stat
>>> integer cmdbuf(4)
>>> integer lcmdbuf
>>> integer ierr
>>> integer ntasks
>>> integer taskid
>>> integer bufpos
>>> integer cmd
>>> integer ldata
>>> character*(mpi_max_processor_name) hostnm
>>> integer iuinp
>>> integer iuout
>>> integer lnam
>>> real*8 bcaststart
>>> iuinp = 5
>>> iuout = 6
>>> lcmdbuf = 16
>>> i = 0
>>> call mpi_init(ierr)
>>> call mpi_comm_size (mpi_comm_world, ntasks, ierr)
>>> call mpi_comm_rank (mpi_comm_world, taskid, ierr)
>>> hostnm = ' '
>>> call mpi_get_processor_name (hostnm, lnam, ierr)
>>> write (iuout,*) 'task',taskid,'of',ntasks,'on ',hostnm(1:lnam)
>>> if (taskid == 0) then
>>> inpfile = ' '
>>> do
>>> write (iuout,*) 'Enter .inp filename:'
>>> read (iuinp,*) inpfile
>>> if (inpfile /= ' ') exit
>>> end do
>>> l = len_trim(inpfile)
>>> write (iuout,*) 'task',taskid,inpfile(1:l)
>>> bufpos = 0
>>> cmd = 1099
>>> ldata = 7
>>> write (iuout,*) 'task',taskid,cmdbuf,bufpos
>>> write (iuout,*) 'task',taskid,cmd,lcmdbuf
>>> call mpi_pack (cmd, 1, MPI_INTEGER,
>>> * cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
>>> write (iuout,*) 'task',taskid,cmdbuf,bufpos
>>> write (iuout,*) 'task',taskid,ldata,lcmdbuf
>>> call mpi_pack (ldata, 1, MPI_INTEGER,
>>> * cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
>>> bcaststart = mpi_wtime()
>>> write (iuout,*) 'task',taskid,cmdbuf,bufpos
>>> write (iuout,*) 'task',taskid,bcaststart,lcmdbuf
>>> call mpi_pack (bcaststart, 1, MPI_DOUBLE_PRECISION,
>>> * cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
>>> write (iuout,*) 'task',taskid,cmdbuf,bufpos
>>> end if
>>> call mpi_bcast (cmdbuf, lcmdbuf, MPI_PACKED,
>>> * 0, MPI_COMM_WORLD, ierr)
>>> call mpi_finalize(ierr)
>>> stop
>>> end program mainf
>>>
>>> program mainf
>>> ! ifort -g -I /share/local/lib/ompi/include -o bug bug.f90
>>> ! -L /share/local/lib/ompi/lib -lmpi_f77 -lmpi
>>> ! mpirun -np 2 ./bug
>>> implicit none
>>> include 'mpif.h'
>>> character(len=80) :: inpfile
>>> character(len=1), dimension(80) :: cinpfile
>>> integer :: l
>>> integer :: i
>>> integer :: stat
>>> integer, dimension(4) :: cmdbuf
>>> integer :: lcmdbuf
>>> integer :: ierr
>>> integer :: ntasks
>>> integer :: taskid
>>> integer :: bufpos
>>> integer :: cmd
>>> integer :: ldata
>>> character(len=mpi_max_processor_name) :: hostnm
>>> integer :: iuinp = 5
>>> integer :: iuout = 6
>>> integer :: lnam
>>> real(8) :: bcaststart
>>> lcmdbuf = 16
>>> i = 0
>>> call mpi_init(ierr)
>>> call mpi_comm_size (mpi_comm_world, ntasks, ierr)
>>> call mpi_comm_rank (mpi_comm_world, taskid, ierr)
>>> hostnm = ' '
>>> call mpi_get_processor_name (hostnm, lnam, ierr)
>>> write (iuout,*) 'task',taskid,'of',ntasks,'on ',hostnm(1:lnam)
>>> if (taskid == 0) then
>>> inpfile = ' '
>>> do
>>> write (iuout,*) 'Enter .inp filename:'
>>> read (iuinp,*) inpfile
>>> if (inpfile /= ' ') exit
>>> end do
>>> l = len_trim(inpfile)
>>> do i=1,l
>>> cinpfile(i) = inpfile(i:i)
>>> end do
>>> cinpfile(l+1) = char(0)
>>> write (iuout,*) 'task',taskid,inpfile(1:l)
>>> bufpos = 0
>>> cmd = 1099
>>> ldata = 7
>>> write (iuout,*) 'task',taskid,cmdbuf,bufpos
>>> ! The next two lines exhibit the bug
>>> ! Uncomment the first and the program works
>>> ! Uncomment the second and the program dies in mpi_pack
>>> ! and produces a 137 MB core file.
>>> write (iuout,*) 'task',taskid,cmd,lcmdbuf
>>> ! write (iuout,*) 'task',taskid,cmd
>>> call mpi_pack (cmd, 1, MPI_INTEGER, &
>>> cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
>>> write (iuout,*) 'task',taskid,cmdbuf,bufpos
>>> write (iuout,*) 'task',taskid,ldata,lcmdbuf
>>> call mpi_pack (ldata, 1, MPI_INTEGER, &
>>> cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
>>> bcaststart = mpi_wtime()
>>> write (iuout,*) 'task',taskid,cmdbuf,bufpos
>>> write (iuout,*) 'task',taskid,bcaststart,lcmdbuf
>>> call mpi_pack (bcaststart, 1, MPI_DOUBLE_PRECISION, &
>>> cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
>>> write (iuout,*) 'task',taskid,cmdbuf,bufpos
>>> end if
>>> call mpi_bcast (cmdbuf, lcmdbuf, MPI_PACKED, &
>>> 0, MPI_COMM_WORLD, ierr)
>>> call mpi_finalize(ierr)
>>> stop
>>> end program mainf
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
>
> Regards,
> Mohammad Huwaidi
>
> We can't resolve problems by using the same kind of thinking we used
> when we created them.
> --Albert Einstein
> <mohammad.vcf>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems