Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-10-25 08:51:07


I'm assuming that this is a production version of NP, right? (i.e., not
a development version)

Can you run the MPI processes through valgrind to see where the error
really occurs? This corefile only shows the final results, not the
actual cause.

Troy Benjegerdes wrote:
> On Mon, Oct 24, 2005 at 06:03:02PM -0500, Troy Benjegerdes wrote:
>
>>troy_at_opteron1:/usr/src/netpipe3-dev$ mpirun -np 2 -mca btl_base_exclude
>>openib NPmpi
>>1: opteron1
>>0: opteron1
>>mpirun noticed that job rank 1 with PID 352 on node "localhost" exited
>>on signal 11.
>>1 process killed (possibly by Open MPI)
>>
>>This is debian-amd64 (from
>>deb http://mirror.espri.arizona.edu/debian-amd64/debian/ etch main )
>>
>>On Mon, Oct 24, 2005 at 10:36:29AM -0500, Brian Barrett wrote:
>>
>>>That's a really weird backtrace - it seems to indicate that the
>>>datatype engine is improperly calling free(). Can you try running
>>>without openib (add "-mca btl_base_exclude openib" to the mpirun
>>>arguments) and see if the problem goes away? Also, what platform was
>>>this on?
>
>
> Okay.. here's another backtrace, this time with no openib.
>
> 0x00002aaaab6fb365 in malloc_usable_size () from /lib/libc.so.6
> (gdb) bt
> #0 0x00002aaaab6fb365 in malloc_usable_size () from /lib/libc.so.6
> #1 0x00002aaaaaecb016 in opal_mem_free_free_hook ()
> from /usr/local/lib/libopal.so.0
> #2 0x00002aaaaac0c663 in ompi_convertor_cleanup ()
> from /usr/local/lib/libmpi.so.0
> #3 0x00002aaaaeb41dbe in mca_pml_ob1_match_completion_cache ()
> from /usr/local/lib/openmpi/mca_pml_ob1.so
> #4 0x00002aaaaf179c7b in mca_btl_sm_component_progress ()
> from /usr/local/lib/openmpi/mca_btl_sm.so
> #5 0x00002aaaaee5eefe in mca_bml_r2_progress ()
> from /usr/local/lib/openmpi/mca_bml_r2.so
> #6 0x00002aaaaeb3dd4e in mca_pml_ob1_progress ()
> from /usr/local/lib/openmpi/mca_pml_ob1.so
> #7 0x00002aaaaaeb5c4a in opal_progress () from
> /usr/local/lib/libopal.so.0
> #8 0x00002aaaaeb3c265 in mca_pml_ob1_recv ()
> from /usr/local/lib/openmpi/mca_pml_ob1.so
> #9 0x00002aaaaf6a0936 in mca_coll_basic_barrier_intra_lin ()
> from /usr/local/lib/openmpi/mca_coll_basic.so
> #10 0x00002aaaaac1f3b8 in PMPI_Barrier () from
> /usr/local/lib/libmpi.so.0
> #11 0x00000000004030a2 in Sync (p=0x10053d900) at src/mpi.c:89
> #12 0x0000000000401f83 in main (argc=2, argv=0x7fffffe30ae8)
> at src/netpipe.c:463
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/