Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: libevent update
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-03-18 21:12:16


When did you fix it? I merged the trunk down to the libevent-merge
branch late this afternoon (r17869).

On Mar 18, 2008, at 7:29 PM, George Bosilca wrote:

> This has been fixed in the trunk, but not yet merged in the branch.
>
> george.
>
> On Mar 18, 2008, at 7:17 PM, Josh Hursey wrote:
>
>> I found another problem with the libevent branch.
>>
>> If I set "-mca btl tcp,self" on the command line then I get a segfult
>> when sending messages > 16 KB. I can try to make a smaller repeater,
>> but if you use the "progress" or "simple" tests in ompi-tests below:
>> https://svn.open-mpi.org/svn/ompi-tests/trunk/iu/ft/correctness
>>
>> To build:
>> shell$ make
>> To run with failure:
>> shell$ mpirun -np 2 -mca btl tcp,self progress -s 16 -v 1
>> To run without failure:
>> shell$ mpirun -np 2 -mca btl tcp,self progress -s 15 -v 1
>>
>> This program will display the message "Checkpoint at any time...". If
>> you send mpirun SIGUSR2 it will progress to the next stage of the
>> test. The failure occurs when the first message before this becomes
>> an issue though.
>>
>> I was using Odin, and if I do not specify the btls then the test will
>> pass as normal.
>>
>> The backtrace is below:
>> ------------------------------------------
>> ...
>> Core was generated by `progress -s 16 -v 1'.
>> Program terminated with signal 11, Segmentation fault.
>> #0 0x0000002a9793318b in mca_bml_base_free
>> (bml_btl=0x736275705f61636d, des=0x559700) at ../../../../ompi/mca/
>> bml/bml.h:267
>> 267 bml_btl->btl_free( bml_btl->btl, des );
>> (gdb) bt
>> #0 0x0000002a9793318b in mca_bml_base_free
>> (bml_btl=0x736275705f61636d, des=0x559700) at ../../../../ompi/mca/
>> bml/bml.h:267
>> #1 0x0000002a9793304d in mca_pml_ob1_put_completion (btl=0x5598c0,
>> ep=0x0, des=0x559700, status=0) at pml_ob1_recvreq.c:190
>> #2 0x0000002a97930069 in mca_pml_ob1_recv_frag_callback
>> (btl=0x5598c0, tag=64 '@', des=0x2a989d2b00, cbdata=0x0) at
>> pml_ob1_recvfrag.c:149
>> #3 0x0000002a97d5f3e0 in mca_btl_tcp_endpoint_recv_handler (sd=10,
>> flags=2, user=0x5a5df0) at btl_tcp_endpoint.c:696
>> #4 0x0000002a95a0ab93 in event_process_active (base=0x508c80) at
>> event.c:591
>> #5 0x0000002a95a0af59 in opal_event_base_loop (base=0x508c80,
>> flags=2) at event.c:763
>> #6 0x0000002a95a0ad2b in opal_event_loop (flags=2) at event.c:670
>> #7 0x0000002a959fadf8 in opal_progress () at runtime/
>> opal_progress.c:
>> 169
>> #8 0x0000002a9792caae in opal_condition_wait (c=0x2a9587d940,
>> m=0x2a9587d9c0) at ../../../../opal/threads/condition.h:93
>> #9 0x0000002a9792c9dd in ompi_request_wait_completion (req=0x5a5380)
>> at ../../../../ompi/request/request.h:381
>> #10 0x0000002a9792c920 in mca_pml_ob1_recv (addr=0x5baf70,
>> count=16384, datatype=0x503770, src=1, tag=1001, comm=0x5039a0,
>> status=0x0)
>> at pml_ob1_irecv.c:104
>> #11 0x0000002a956f1f00 in PMPI_Recv (buf=0x5baf70, count=16384,
>> type=0x503770, source=1, tag=1001, comm=0x5039a0, status=0x0) at
>> precv.c:75
>> #12 0x000000000040211f in exchange_stage1 (ckpt_num=1) at
>> progress.c:414
>> #13 0x0000000000401295 in main (argc=5, argv=0x7fbfffe668) at
>> progress.c:131
>> (gdb) p bml_btl
>> $1 = (mca_bml_base_btl_t *) 0x736275705f61636d
>> (gdb) p *bml_btl
>> Cannot access memory at address 0x736275705f61636d
>> ------------------------------------------
>>
>> -- Josh
>>
>> On Mar 17, 2008, at 2:50 PM, Jeff Squyres wrote:
>>
>>> WHAT: Bring new version of libevent to the trunk.
>>>
>>> WHY: Newer version, slightly better performance (lower overheads /
>>> lighter weight), properly integrate the use of epoll and other
>>> scalable fd monitoring mechanisms.
>>>
>>> WHERE: 98% of the changes are in opal/event; there's a few changes
>>> to
>>> configury and one change to the orted.
>>>
>>> TIMEOUT: COB, Friday, 21 March 2008
>>>
>>> DESCRIPTION:
>>>
>>> George/UTK has done the bulk of the work to integrate a new
>>> version of
>>> libevent on the following tmp branch:
>>>
>>> https://svn.open-mpi.org/svn/ompi/tmp-public/libevent-merge
>>>
>>> ** WE WOULD VERY MUCH APPRECIATE IF PEOPLE COULD MTT TEST THIS
>>> BRANCH!
>>> **
>>>
>>> Cisco ran MTT on this branch on Friday and everything checked out
>>> (i.e., no more failures than on the trunk). We just made a few more
>>> minor changes today and I'm running MTT again now, but I'm not
>>> expecting any new failures (MTT will take several hours). We would
>>> like to bring the new libevent in over this upcoming weekend, but
>>> would very much appreciate if others could test on their platforms
>>> (Cisco tests mainly 64 bit RHEL4U4). This new libevent *should*
>>> be a
>>> fairly side-effect free change, but it is possible that since we're
>>> now using epoll and other scalable fd monitoring tools, we'll run
>>> into
>>> some unanticipated issues on some platforms.
>>>
>>> Here's a consolidated diff if you want to see the changes:
>>>
>>> https://svn.open-mpi.org/trac/ompi/changeset?old_path=tmp-public%
>>> 2Flibevent-merge&old=17846&new_path=trunk&new=17842
>>>
>>> Thanks.
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems