Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: libevent update
From: George Bosilca (bosilca_at_[hidden])
Date: 2008-03-18 19:29:35


This has been fixed in the trunk, but not yet merged in the branch.

   george.

On Mar 18, 2008, at 7:17 PM, Josh Hursey wrote:

> I found another problem with the libevent branch.
>
> If I set "-mca btl tcp,self" on the command line then I get a segfult
> when sending messages > 16 KB. I can try to make a smaller repeater,
> but if you use the "progress" or "simple" tests in ompi-tests below:
> https://svn.open-mpi.org/svn/ompi-tests/trunk/iu/ft/correctness
>
> To build:
> shell$ make
> To run with failure:
> shell$ mpirun -np 2 -mca btl tcp,self progress -s 16 -v 1
> To run without failure:
> shell$ mpirun -np 2 -mca btl tcp,self progress -s 15 -v 1
>
> This program will display the message "Checkpoint at any time...". If
> you send mpirun SIGUSR2 it will progress to the next stage of the
> test. The failure occurs when the first message before this becomes
> an issue though.
>
> I was using Odin, and if I do not specify the btls then the test will
> pass as normal.
>
> The backtrace is below:
> ------------------------------------------
> ...
> Core was generated by `progress -s 16 -v 1'.
> Program terminated with signal 11, Segmentation fault.
> #0 0x0000002a9793318b in mca_bml_base_free
> (bml_btl=0x736275705f61636d, des=0x559700) at ../../../../ompi/mca/
> bml/bml.h:267
> 267 bml_btl->btl_free( bml_btl->btl, des );
> (gdb) bt
> #0 0x0000002a9793318b in mca_bml_base_free
> (bml_btl=0x736275705f61636d, des=0x559700) at ../../../../ompi/mca/
> bml/bml.h:267
> #1 0x0000002a9793304d in mca_pml_ob1_put_completion (btl=0x5598c0,
> ep=0x0, des=0x559700, status=0) at pml_ob1_recvreq.c:190
> #2 0x0000002a97930069 in mca_pml_ob1_recv_frag_callback
> (btl=0x5598c0, tag=64 '@', des=0x2a989d2b00, cbdata=0x0) at
> pml_ob1_recvfrag.c:149
> #3 0x0000002a97d5f3e0 in mca_btl_tcp_endpoint_recv_handler (sd=10,
> flags=2, user=0x5a5df0) at btl_tcp_endpoint.c:696
> #4 0x0000002a95a0ab93 in event_process_active (base=0x508c80) at
> event.c:591
> #5 0x0000002a95a0af59 in opal_event_base_loop (base=0x508c80,
> flags=2) at event.c:763
> #6 0x0000002a95a0ad2b in opal_event_loop (flags=2) at event.c:670
> #7 0x0000002a959fadf8 in opal_progress () at runtime/opal_progress.c:
> 169
> #8 0x0000002a9792caae in opal_condition_wait (c=0x2a9587d940,
> m=0x2a9587d9c0) at ../../../../opal/threads/condition.h:93
> #9 0x0000002a9792c9dd in ompi_request_wait_completion (req=0x5a5380)
> at ../../../../ompi/request/request.h:381
> #10 0x0000002a9792c920 in mca_pml_ob1_recv (addr=0x5baf70,
> count=16384, datatype=0x503770, src=1, tag=1001, comm=0x5039a0,
> status=0x0)
> at pml_ob1_irecv.c:104
> #11 0x0000002a956f1f00 in PMPI_Recv (buf=0x5baf70, count=16384,
> type=0x503770, source=1, tag=1001, comm=0x5039a0, status=0x0) at
> precv.c:75
> #12 0x000000000040211f in exchange_stage1 (ckpt_num=1) at progress.c:
> 414
> #13 0x0000000000401295 in main (argc=5, argv=0x7fbfffe668) at
> progress.c:131
> (gdb) p bml_btl
> $1 = (mca_bml_base_btl_t *) 0x736275705f61636d
> (gdb) p *bml_btl
> Cannot access memory at address 0x736275705f61636d
> ------------------------------------------
>
> -- Josh
>
> On Mar 17, 2008, at 2:50 PM, Jeff Squyres wrote:
>
>> WHAT: Bring new version of libevent to the trunk.
>>
>> WHY: Newer version, slightly better performance (lower overheads /
>> lighter weight), properly integrate the use of epoll and other
>> scalable fd monitoring mechanisms.
>>
>> WHERE: 98% of the changes are in opal/event; there's a few changes to
>> configury and one change to the orted.
>>
>> TIMEOUT: COB, Friday, 21 March 2008
>>
>> DESCRIPTION:
>>
>> George/UTK has done the bulk of the work to integrate a new version
>> of
>> libevent on the following tmp branch:
>>
>> https://svn.open-mpi.org/svn/ompi/tmp-public/libevent-merge
>>
>> ** WE WOULD VERY MUCH APPRECIATE IF PEOPLE COULD MTT TEST THIS
>> BRANCH!
>> **
>>
>> Cisco ran MTT on this branch on Friday and everything checked out
>> (i.e., no more failures than on the trunk). We just made a few more
>> minor changes today and I'm running MTT again now, but I'm not
>> expecting any new failures (MTT will take several hours). We would
>> like to bring the new libevent in over this upcoming weekend, but
>> would very much appreciate if others could test on their platforms
>> (Cisco tests mainly 64 bit RHEL4U4). This new libevent *should* be a
>> fairly side-effect free change, but it is possible that since we're
>> now using epoll and other scalable fd monitoring tools, we'll run
>> into
>> some unanticipated issues on some platforms.
>>
>> Here's a consolidated diff if you want to see the changes:
>>
>> https://svn.open-mpi.org/trac/ompi/changeset?old_path=tmp-public%
>> 2Flibevent-merge&old=17846&new_path=trunk&new=17842
>>
>> Thanks.
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



  • application/pkcs7-signature attachment: smime.p7s