Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Segfault in 1.3 branch
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-07-15 17:27:54


To be clear -- this looks like a different issue than what Pasha was
reporting.

On Jul 15, 2008, at 8:55 AM, Rolf vandeVaart wrote:

>
> Lenny, I opened a ticket for something that looks the same as this.
> Maybe you can add your details to it.
>
> https://svn.open-mpi.org/trac/ompi/ticket/1386
>
> Rolf
>
> Lenny Verkhovsky wrote:
>>
>> I guess it should be here, sorry.
>>
>> /home/USERS/lenny/OMPI_ORTE_18850/bin/mpirun -np 2 -H
>> witch2,witch3 ./IMB-MPI1_18850 PingPong
>> #---------------------------------------------------
>> # Intel (R) MPI Benchmark Suite V3.0v modified by Voltaire, MPI-1
>> part
>> #---------------------------------------------------
>> # Date : Tue Jul 15 15:11:30 2008
>> # Machine : x86_64
>> # System : Linux
>> # Release : 2.6.16.46-0.12-smp
>> # Version : #1 SMP Thu May 17 14:00:09 UTC 2007
>> # MPI Version : 2.0
>> # MPI Thread Environment: MPI_THREAD_SINGLE
>>
>> #
>> # Minimum message length in bytes: 0
>> # Maximum message length in bytes: 67108864
>> #
>> # MPI_Datatype : MPI_BYTE
>> # MPI_Datatype for reductions : MPI_FLOAT
>> # MPI_Op : MPI_SUM
>> #
>> #
>>
>> # List of Benchmarks to run:
>>
>> # PingPong
>> [witch3:32461] *** Process received signal ***
>> [witch3:32461] Signal: Segmentation fault (11)
>> [witch3:32461] Signal code: Address not mapped (1)
>> [witch3:32461] Failing at address: 0x20
>> [witch3:32461] [ 0] /lib64/libpthread.so.0 [0x2b514fcedc10]
>> [witch3:32461] [ 1] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/
>> mca_pml_ob1.so [0x2b51510b416a]
>> [witch3:32461] [ 2] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/
>> mca_pml_ob1.so [0x2b51510b4661]
>> [witch3:32461] [ 3] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/
>> mca_pml_ob1.so [0x2b51510b180e]
>> [witch3:32461] [ 4] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/
>> mca_btl_openib.so [0x2b5151811c22]
>> [witch3:32461] [ 5] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/
>> mca_btl_openib.so [0x2b51518132e9]
>> [witch3:32461] [ 6] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/
>> mca_bml_r2.so [0x2b51512c412f]
>> [witch3:32461] [ 7] /home/USERS/lenny/OMPI_ORTE_18850/lib/libopen-
>> pal.so.0(opal_progress+0x5a) [0x2b514f71268a]
>> [witch3:32461] [ 8] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/
>> mca_pml_ob1.so [0x2b51510af0f5]
>> [witch3:32461] [ 9] /home/USERS/lenny/OMPI_ORTE_18850/lib/libmpi.so.
>> 0(PMPI_Recv+0x13b) [0x2b514f47941b]
>> [witch3:32461] [10] ./IMB-MPI1_18850(IMB_pingpong+0x1a1) [0x4073cd]
>> [witch3:32461] [11] ./IMB-MPI1_18850(IMB_warm_up+0x2d) [0x405e49]
>> [witch3:32461] [12] ./IMB-MPI1_18850(main+0x394) [0x4034d4]
>> [witch3:32461] [13] /lib64/libc.so.6(__libc_start_main+0xf4)
>> [0x2b514fe14154]
>> [witch3:32461] [14] ./IMB-MPI1_18850 [0x4030a9]
>> [witch3:32461] *** End of error message ***
>> mpirun: killing job...
>>
>> --------------------------------------------------------------------------
>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>> below. Additional manual cleanup may be required - please refer to
>> the "orte-clean" tool for assistance.
>> --------------------------------------------------------------------------
>> witch2
>> witch3
>>
>>
>> On 7/15/08, *Pavel Shamis (Pasha)* <pasha_at_[hidden] <mailto:pasha_at_[hidden]
>> >> wrote:
>>
>>
>> It looks like a new issue to me, Pasha. Possibly a side
>> consequence of the
>> IOF change made by Jeff and I the other day. From what I can
>> see, it looks
>> like you app was a simple "hello" - correct?
>>
>> Yep, it is simple hello application.
>>
>> If you look at the error, the problem occurs when mpirun is
>> trying to route
>> a message. Since the app is clearly running at this time, the
>> problem is
>> probably in the IOF. The error message shows that mpirun is
>> attempting to
>> route a message to a jobid that doesn't exist. We have a test
>> in the RML
>> that forces an "abort" if that occurs.
>>
>> I would guess that there is either a race condition or memory
>> corruption
>> occurring somewhere, but I have no idea where.
>>
>> This may be the "new hole in the dyke" I cautioned about in
>> earlier notes
>> regarding the IOF... :-)
>>
>> Still, given that this hits rarely, it probably is a more
>> acceptable bug to
>> leave in the code than the one we just fixed (duplicated
>> stdin)...
>>
>> It is not so rare issue, 19 failures in my MTT run
>> (http://www.open-mpi.org/mtt/index.php?do_redir=765).
>>
>> Pasha
>>
>> Ralph
>>
>>
>>
>> On 7/14/08 1:11 AM, "Pavel Shamis (Pasha)"
>> <pasha_at_[hidden] <mailto:pasha_at_[hidden]>>
>> wrote:
>>
>>
>> Please see http://www.open-mpi.org/mtt/index.php?do_redir=764
>>
>> The error is not consistent. It takes a lot of iteration
>> to reproduce it.
>> In my MTT testing I seen it few times.
>>
>> Is it know issue ?
>>
>> Regards,
>> Pasha
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden] <mailto:devel_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems