Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2006-02-21 16:11:36


The stack trace is reported by one of our internal tools. The last 2
entries belong to the Open MPI project and correspond the the signal
handler. When you look at the stack you should ignore the last 2
entries (number [0] and [1]).

The frag_recv is the place where we detect that one of the remote
node failed. Once one of the node segfault, the other will abort the
MPI jobs, as by default we follow the MPI specifications.

It's very difficult to say what's wrong with such little information.
If you can create a non NDA test that fails for you, you will make
our life easier.

   Thanks,
     george.

On Feb 21, 2006, at 3:12 PM, Luke Cyca wrote:

> Hi,
>
> I'm experiencing seemingly random crashes when running my program
> with OpenMPI version 1.0.2a7r9094. My simulation runs fine through
> lots of iterations, sometimes for several hours, and then quits with
> the following output.
>
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>> Failing at addr:0x37e3200
>> [0] func:/usr/lib/libopal.so.0 [0x2aaaac0ab2da]
>> [1] func:/lib/libpthread.so.0 [0x2aaaaacd04d0]
>> [2] <My program's stack trace removed for NDA paranoia>
>> *** End of error message ***
>> [maltose:09314] mca_btl_tcp_frag_send: writev failed with errno=104
> Sometimes the error gets reported by frag_recv instead:
>> [glucose:09277] mca_btl_tcp_frag_recv: writev failed with errno=104
>
> Error 104 corresponds to ECONNRESET "Connection reset by peer". In
> all other situations though, my network seems to be operating well.
> Running `ifconfig eth0` on any of the nodes reports no errors or
> dropped packets.
>
> I had the same behavior when using the 1.0.1r8453 release.
>
> I'm unsure how to troubleshoot this further. Any help or suggestions
> would be greatly appreciated.
>
> Here's my ompi_info output:
>> Open MPI: 1.0.2a7r9094
>> Open MPI SVN revision: r9094
>> Open RTE: 1.0.2a7r9094
>> Open RTE SVN revision: r9094
>> OPAL: 1.0.2a7r9094
>> OPAL SVN revision: r9094
>> Prefix: /usr/local
>> Configured architecture: x86_64-unknown-linux-gnu
>> Configured by: zymo
>> Configured on: Mon Feb 20 15:56:27 PST 2006
>> Configure host: idose
>> Built by: zymo
>> Built on: Mon Feb 20 16:14:53 PST 2006
>> Built host: idose
>> C bindings: yes
>> C++ bindings: yes
>> Fortran77 bindings: yes (all)
>> Fortran90 bindings: no
>> C compiler: gcc
>> C compiler absolute: /usr/bin/gcc
>> C++ compiler: g++
>> C++ compiler absolute: /usr/bin/g++
>> Fortran77 compiler: g77
>> Fortran77 compiler abs: /usr/bin/g77
>> Fortran90 compiler: none
>> Fortran90 compiler abs: none
>> C profiling: yes
>> C++ profiling: yes
>> Fortran77 profiling: yes
>> Fortran90 profiling: no
>> C++ exceptions: no
>> Thread support: posix (mpi: no, progress: no)
>> Internal debug support: no
>> MPI parameter check: runtime
>> Memory profiling support: no
>> Memory debugging support: no
>> libltdl support: 1
>> MCA memory: malloc_hooks (MCA v1.0, API v1.0,
>> Component v1.0.2)
>> MCA paffinity: linux (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA maffinity: first_use (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA timer: linux (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
>> MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
>> MCA coll: basic (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA coll: self (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA coll: sm (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA io: romio (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA mpool: sm (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA pml: teg (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA bml: r2 (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA bml: r2 (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA ptl: self (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA ptl: sm (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA ptl: tcp (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA btl: self (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA btl: sm (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
>> MCA topo: unity (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA gpr: null (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA gpr: proxy (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA gpr: replica (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA iof: proxy (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA iof: svc (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA ns: proxy (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA ns: replica (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
>> MCA ras: dash_host (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA ras: hostfile (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA ras: localhost (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA ras: slurm (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA rds: hostfile (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA rds: resfile (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA rmaps: round_robin (MCA v1.0, API v1.0,
>> Component v1.0.2)
>> MCA rmgr: proxy (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA rml: oob (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA pls: daemon (MCA v1.0, API v1.0, Component
>> v1.0.1)
>> MCA pls: fork (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA pls: proxy (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA pls: rsh (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA pls: slurm (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA sds: env (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA sds: pipe (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA sds: seed (MCA v1.0, API v1.0, Component v1.0.2)
>> MCA sds: singleton (MCA v1.0, API v1.0, Component
>> v1.0.2)
>> MCA sds: slurm (MCA v1.0, API v1.0, Component
>> v1.0.2)
>
> ____________
> Luke Cyca
> (604) 678-1388 ext. 32
> luke_at_[hidden]
> www.zymeworks.com
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

"Half of what I say is meaningless; but I say it so that the other
half may reach you"
                                   Kahlil Gibran