Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Torsten Hoefler (torsten.hoefler_at_[hidden])
Date: 2005-08-04 04:06:37

On Tue, Aug 02, 2005 at 02:40:21PM -0500, Brian Barrett wrote:
> The tree now compiles with the --enable-mpi-threads problem. There
> is a bug in the event library that will cause deadlocks in orterun,
> so the tree isn't exactly useful right now. Tim Woodall is going to
> look into the problem.
ok - thanks!

A new problem arised after compiling and running my first test-program.
It simply spawns a separate thread on each rank and sends/receives 1
byte (MPI_BYTE) messages in this thread. There seems to be a race
condition, sometimes, all messages are received correctly, sometimes all
messages fail and the receiving rank eats up a lot of memory (>600MB)
and segfaults.

The backtrace is:
#0 0x0015d828 in ompi_convertor_unpack (pConv=0x83569e0, iov=0x479e798,
    out_size=0x479e7bc, max_data=0x479e7b8, freeAfter=0x479e7b4)
    at convertor.c:104
#1 0x00f76af4 in mca_ptl_tcp_recv_frag_progress (frag=0x8356980)
    at ptl_tcp_recvfrag.h:166
#2 0x00f76124 in mca_ptl_tcp_matched (ptl=0x83321a8, frag=0x8356980)
    at ptl_tcp.c:302
#3 0x0090d314 in mca_pml_teg_recv_frag_match (ptl=0x8320948, frag=0x8356980,
    header=0x8356ab4) at pml_teg_recvfrag.c:82
#4 0x00f7bbdc in mca_ptl_tcp_recv_frag_handler (frag=0x8356a94, sd=12)
    at ptl_tcp_recvfrag.c:107
#5 0x00f7a20f in mca_ptl_tcp_peer_recv_handler (sd=12, flags=2,
    user=0x836b628) at ptl_tcp_peer.c:606
#6 0x002a8ff8 in opal_event_process_active () at event.c:453
#7 0x002a92e3 in opal_event_loop (flags=2) at event.c:543
#8 0x002b733b in opal_progress () at opal_progress.c:211
#9 0x00909295 in opal_condition_wait (c=0x23bc80, m=0x23bce0)
    at condition.h:66
#10 0x00908a93 in mca_pml_teg_recv (addr=0x479ea94, count=1,
    datatype=0x804a4a8, src=-1, tag=100002, comm=0x804a5f0, status=0x8380108)
    at pml_teg_irecv.c:100
#11 0x001bc50f in PMPI_Recv (buf=0x479ea94, count=1, type=0x804a4a8,
    source=-1, tag=100002, comm=0x804a5f0, status=0x8380108) at precv.c:66
#12 0x08048f66 in MPI_Barrier_start_worker_thread (param=0x83809f0)
    at nbbarr.c:76
#13 0x0072fdec in pthread_create@@GLIBC_2.1 () from /lib/tls/
#14 0x0082519a in iswctype_l () from /lib/tls/

The zipped corefile can be found at:

Any Idea or should I try to debug it?


 bash$ :(){ :|:&};: ----- pgp: -----
An optimist believes we live in the best of all possible worlds.  
A pessimist is sure of it!