Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: libevent update
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-03-19 08:06:34


I re-merged down to the libevent-merge branch (to include r17872) and
a new tarball has been uploaded to http://www.open-mpi.org/~jsquyres/unofficial/

On Mar 18, 2008, at 10:11 PM, George Bosilca wrote:

> Commit 17872 is the one you're looking for.
>
> https://svn.open-mpi.org/trac/ompi/changeset/17872
>
> george.
>
> On Mar 18, 2008, at 9:12 PM, Jeff Squyres wrote:
>
>> When did you fix it? I merged the trunk down to the libevent-merge
>> branch late this afternoon (r17869).
>>
>>
>> On Mar 18, 2008, at 7:29 PM, George Bosilca wrote:
>>
>>> This has been fixed in the trunk, but not yet merged in the branch.
>>>
>>> george.
>>>
>>> On Mar 18, 2008, at 7:17 PM, Josh Hursey wrote:
>>>
>>>> I found another problem with the libevent branch.
>>>>
>>>> If I set "-mca btl tcp,self" on the command line then I get a
>>>> segfult
>>>> when sending messages > 16 KB. I can try to make a smaller
>>>> repeater,
>>>> but if you use the "progress" or "simple" tests in ompi-tests
>>>> below:
>>>> https://svn.open-mpi.org/svn/ompi-tests/trunk/iu/ft/correctness
>>>>
>>>> To build:
>>>> shell$ make
>>>> To run with failure:
>>>> shell$ mpirun -np 2 -mca btl tcp,self progress -s 16 -v 1
>>>> To run without failure:
>>>> shell$ mpirun -np 2 -mca btl tcp,self progress -s 15 -v 1
>>>>
>>>> This program will display the message "Checkpoint at any
>>>> time...". If
>>>> you send mpirun SIGUSR2 it will progress to the next stage of the
>>>> test. The failure occurs when the first message before this becomes
>>>> an issue though.
>>>>
>>>> I was using Odin, and if I do not specify the btls then the test
>>>> will
>>>> pass as normal.
>>>>
>>>> The backtrace is below:
>>>> ------------------------------------------
>>>> ...
>>>> Core was generated by `progress -s 16 -v 1'.
>>>> Program terminated with signal 11, Segmentation fault.
>>>> #0 0x0000002a9793318b in mca_bml_base_free
>>>> (bml_btl=0x736275705f61636d, des=0x559700) at ../../../../ompi/mca/
>>>> bml/bml.h:267
>>>> 267 bml_btl->btl_free( bml_btl->btl, des );
>>>> (gdb) bt
>>>> #0 0x0000002a9793318b in mca_bml_base_free
>>>> (bml_btl=0x736275705f61636d, des=0x559700) at ../../../../ompi/mca/
>>>> bml/bml.h:267
>>>> #1 0x0000002a9793304d in mca_pml_ob1_put_completion (btl=0x5598c0,
>>>> ep=0x0, des=0x559700, status=0) at pml_ob1_recvreq.c:190
>>>> #2 0x0000002a97930069 in mca_pml_ob1_recv_frag_callback
>>>> (btl=0x5598c0, tag=64 '@', des=0x2a989d2b00, cbdata=0x0) at
>>>> pml_ob1_recvfrag.c:149
>>>> #3 0x0000002a97d5f3e0 in mca_btl_tcp_endpoint_recv_handler (sd=10,
>>>> flags=2, user=0x5a5df0) at btl_tcp_endpoint.c:696
>>>> #4 0x0000002a95a0ab93 in event_process_active (base=0x508c80) at
>>>> event.c:591
>>>> #5 0x0000002a95a0af59 in opal_event_base_loop (base=0x508c80,
>>>> flags=2) at event.c:763
>>>> #6 0x0000002a95a0ad2b in opal_event_loop (flags=2) at event.c:670
>>>> #7 0x0000002a959fadf8 in opal_progress () at runtime/
>>>> opal_progress.c:
>>>> 169
>>>> #8 0x0000002a9792caae in opal_condition_wait (c=0x2a9587d940,
>>>> m=0x2a9587d9c0) at ../../../../opal/threads/condition.h:93
>>>> #9 0x0000002a9792c9dd in ompi_request_wait_completion
>>>> (req=0x5a5380)
>>>> at ../../../../ompi/request/request.h:381
>>>> #10 0x0000002a9792c920 in mca_pml_ob1_recv (addr=0x5baf70,
>>>> count=16384, datatype=0x503770, src=1, tag=1001, comm=0x5039a0,
>>>> status=0x0)
>>>> at pml_ob1_irecv.c:104
>>>> #11 0x0000002a956f1f00 in PMPI_Recv (buf=0x5baf70, count=16384,
>>>> type=0x503770, source=1, tag=1001, comm=0x5039a0, status=0x0) at
>>>> precv.c:75
>>>> #12 0x000000000040211f in exchange_stage1 (ckpt_num=1) at
>>>> progress.c:414
>>>> #13 0x0000000000401295 in main (argc=5, argv=0x7fbfffe668) at
>>>> progress.c:131
>>>> (gdb) p bml_btl
>>>> $1 = (mca_bml_base_btl_t *) 0x736275705f61636d
>>>> (gdb) p *bml_btl
>>>> Cannot access memory at address 0x736275705f61636d
>>>> ------------------------------------------
>>>>
>>>> -- Josh
>>>>
>>>> On Mar 17, 2008, at 2:50 PM, Jeff Squyres wrote:
>>>>
>>>>> WHAT: Bring new version of libevent to the trunk.
>>>>>
>>>>> WHY: Newer version, slightly better performance (lower overheads /
>>>>> lighter weight), properly integrate the use of epoll and other
>>>>> scalable fd monitoring mechanisms.
>>>>>
>>>>> WHERE: 98% of the changes are in opal/event; there's a few changes
>>>>> to
>>>>> configury and one change to the orted.
>>>>>
>>>>> TIMEOUT: COB, Friday, 21 March 2008
>>>>>
>>>>> DESCRIPTION:
>>>>>
>>>>> George/UTK has done the bulk of the work to integrate a new
>>>>> version of
>>>>> libevent on the following tmp branch:
>>>>>
>>>>> https://svn.open-mpi.org/svn/ompi/tmp-public/libevent-merge
>>>>>
>>>>> ** WE WOULD VERY MUCH APPRECIATE IF PEOPLE COULD MTT TEST THIS
>>>>> BRANCH!
>>>>> **
>>>>>
>>>>> Cisco ran MTT on this branch on Friday and everything checked out
>>>>> (i.e., no more failures than on the trunk). We just made a few
>>>>> more
>>>>> minor changes today and I'm running MTT again now, but I'm not
>>>>> expecting any new failures (MTT will take several hours). We
>>>>> would
>>>>> like to bring the new libevent in over this upcoming weekend, but
>>>>> would very much appreciate if others could test on their platforms
>>>>> (Cisco tests mainly 64 bit RHEL4U4). This new libevent *should*
>>>>> be a
>>>>> fairly side-effect free change, but it is possible that since
>>>>> we're
>>>>> now using epoll and other scalable fd monitoring tools, we'll run
>>>>> into
>>>>> some unanticipated issues on some platforms.
>>>>>
>>>>> Here's a consolidated diff if you want to see the changes:
>>>>>
>>>>> https://svn.open-mpi.org/trac/ompi/changeset?old_path=tmp-public%
>>>>> 2Flibevent-merge&old=17846&new_path=trunk&new=17842
>>>>>
>>>>> Thanks.
>>>>>
>>>>> --
>>>>> Jeff Squyres
>>>>> Cisco Systems
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems