Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: libevent update
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-03-18 19:17:20


I found another problem with the libevent branch.

If I set "-mca btl tcp,self" on the command line then I get a segfult
when sending messages > 16 KB. I can try to make a smaller repeater,
but if you use the "progress" or "simple" tests in ompi-tests below:
   https://svn.open-mpi.org/svn/ompi-tests/trunk/iu/ft/correctness

To build:
   shell$ make
To run with failure:
   shell$ mpirun -np 2 -mca btl tcp,self progress -s 16 -v 1
To run without failure:
   shell$ mpirun -np 2 -mca btl tcp,self progress -s 15 -v 1

This program will display the message "Checkpoint at any time...". If
you send mpirun SIGUSR2 it will progress to the next stage of the
test. The failure occurs when the first message before this becomes
an issue though.

I was using Odin, and if I do not specify the btls then the test will
pass as normal.

The backtrace is below:
------------------------------------------
...
Core was generated by `progress -s 16 -v 1'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000002a9793318b in mca_bml_base_free
(bml_btl=0x736275705f61636d, des=0x559700) at ../../../../ompi/mca/
bml/bml.h:267
267 bml_btl->btl_free( bml_btl->btl, des );
(gdb) bt
#0 0x0000002a9793318b in mca_bml_base_free
(bml_btl=0x736275705f61636d, des=0x559700) at ../../../../ompi/mca/
bml/bml.h:267
#1 0x0000002a9793304d in mca_pml_ob1_put_completion (btl=0x5598c0,
ep=0x0, des=0x559700, status=0) at pml_ob1_recvreq.c:190
#2 0x0000002a97930069 in mca_pml_ob1_recv_frag_callback
(btl=0x5598c0, tag=64 '@', des=0x2a989d2b00, cbdata=0x0) at
pml_ob1_recvfrag.c:149
#3 0x0000002a97d5f3e0 in mca_btl_tcp_endpoint_recv_handler (sd=10,
flags=2, user=0x5a5df0) at btl_tcp_endpoint.c:696
#4 0x0000002a95a0ab93 in event_process_active (base=0x508c80) at
event.c:591
#5 0x0000002a95a0af59 in opal_event_base_loop (base=0x508c80,
flags=2) at event.c:763
#6 0x0000002a95a0ad2b in opal_event_loop (flags=2) at event.c:670
#7 0x0000002a959fadf8 in opal_progress () at runtime/opal_progress.c:
169
#8 0x0000002a9792caae in opal_condition_wait (c=0x2a9587d940,
m=0x2a9587d9c0) at ../../../../opal/threads/condition.h:93
#9 0x0000002a9792c9dd in ompi_request_wait_completion (req=0x5a5380)
at ../../../../ompi/request/request.h:381
#10 0x0000002a9792c920 in mca_pml_ob1_recv (addr=0x5baf70,
count=16384, datatype=0x503770, src=1, tag=1001, comm=0x5039a0,
status=0x0)
     at pml_ob1_irecv.c:104
#11 0x0000002a956f1f00 in PMPI_Recv (buf=0x5baf70, count=16384,
type=0x503770, source=1, tag=1001, comm=0x5039a0, status=0x0) at
precv.c:75
#12 0x000000000040211f in exchange_stage1 (ckpt_num=1) at progress.c:414
#13 0x0000000000401295 in main (argc=5, argv=0x7fbfffe668) at
progress.c:131
(gdb) p bml_btl
$1 = (mca_bml_base_btl_t *) 0x736275705f61636d
(gdb) p *bml_btl
Cannot access memory at address 0x736275705f61636d
------------------------------------------

-- Josh

On Mar 17, 2008, at 2:50 PM, Jeff Squyres wrote:

> WHAT: Bring new version of libevent to the trunk.
>
> WHY: Newer version, slightly better performance (lower overheads /
> lighter weight), properly integrate the use of epoll and other
> scalable fd monitoring mechanisms.
>
> WHERE: 98% of the changes are in opal/event; there's a few changes to
> configury and one change to the orted.
>
> TIMEOUT: COB, Friday, 21 March 2008
>
> DESCRIPTION:
>
> George/UTK has done the bulk of the work to integrate a new version of
> libevent on the following tmp branch:
>
> https://svn.open-mpi.org/svn/ompi/tmp-public/libevent-merge
>
> ** WE WOULD VERY MUCH APPRECIATE IF PEOPLE COULD MTT TEST THIS BRANCH!
> **
>
> Cisco ran MTT on this branch on Friday and everything checked out
> (i.e., no more failures than on the trunk). We just made a few more
> minor changes today and I'm running MTT again now, but I'm not
> expecting any new failures (MTT will take several hours). We would
> like to bring the new libevent in over this upcoming weekend, but
> would very much appreciate if others could test on their platforms
> (Cisco tests mainly 64 bit RHEL4U4). This new libevent *should* be a
> fairly side-effect free change, but it is possible that since we're
> now using epoll and other scalable fd monitoring tools, we'll run into
> some unanticipated issues on some platforms.
>
> Here's a consolidated diff if you want to see the changes:
>
> https://svn.open-mpi.org/trac/ompi/changeset?old_path=tmp-public%
> 2Flibevent-merge&old=17846&new_path=trunk&new=17842
>
> Thanks.
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel