What is the error that you are getting from compilation failure?
On 3/23/09, Timothy Hayes <hayesti_at_[hidden]> wrote:
> That's a relief to know, although I'm still a bit concerned. I'm looking at
> the code for the OpenMPI 1.3 trunk and in the ob1 component I can see the
> following sequence:
> mca_pml_ob1_recv_frag_callback_match -> append_frag_to_list ->
> MCA_PML_OB1_RECV_FRAG_ALLOC -> OMPI_FREE_LIST_WAIT -> __ompi_free_list_wait
> so I'm guessing unless the deadlock issue has been resolved for that
> function, it will still fail non deterministically. I'm quite eager to give
> it a try, but my component doesn't compile as is with the 1.3 source. Is it
> trivial to convert it?
> Or maybe you were suggesting that I go into the code of ob1 myself and
> manually change every _wait to _get?
> Kind regards
> 2009/3/23 George Bosilca <bosilca_at_[hidden]>
>> It is a known problem. When the freelist is empty going in the
>> ompi_free_list_wait will block the process until at least one fragment
>> became available. As a fragment can became available only when returned by
>> the BTL, this can lead to deadlocks in some cases. The workaround is to ban
>> the usage of the blocking _wait function, and replace it with the
>> non-blocking version _get. The PML has all the required logic to deal with
>> the cases where a fragment cannot be allocated. We changed most of the BTLs
>> to use _get instead of _wait few months ago.
>> On Mar 23, 2009, at 11:58 , Timothy Hayes wrote:
>>> I'm working on an OpenMPI BTL component and am having a recurring
>>> problem, I was wondering if anyone could shed some light on it. I have a
>>> component that's quite straight forward, it uses a pair of lightweight
>>> sockets to take advantage of being in a virtualised environment
>>> (specifically Xen). My code is a bit messy and has lots of inefficiencies,
>>> but the logic seems sound enough. I've been able to execute a few simple
>>> programs successfully using the component, and they work most of the time.
>>> The problem I'm having is actually happening in higher layers,
>>> specifically in my asynchronous receive handler, when I call the callback
>>> function (cbfunc) that was set by the PML in the BTL initialisation phase.
>>> It seems to be getting stuck in an infinite loop at __ompi_free_list_wait(),
>>> in this function there is a condition variable which should get set
>>> eventually but just doesn't. I've stepped through it with GDB and I get a
>>> backtrace of something like this:
>>> mca_btl_xen_endpoint_recv_handler -> mca_btl_xen_endpoint_start_recv ->
>>> mca_pml_ob1_recv_frag_callback -> mca_pml_ob1_recv_frag_match ->
>>> __ompi_free_list_wait -> opal_condition_wait
>>> and from there it just loops. Although this is happening in higher
>>> levels, I haven't noticed something like this happening in any of the other
>>> BTL components so chances are there's something in my code that's causing
>>> this. I very much doubt that it's actually waiting for a list item to be
>>> returned since this infinite loop can occur non deterministically and
>>> sometimes even on the first receive callback.
>>> I'm really not too sure what else to include with this e-mail. I could
>>> send my source code (a bit nasty right now) if it would be helpful, but I'm
>>> hoping that someone might have noticed this problem before or something
>>> similar. Maybe I'm making a common mistake. Any advice would be really
>>> I'm using OpenMPI 1.2.9 from the SVN tag repository.
>>> Kind regards
>>> Tim Hayes
>>> devel mailing list
>> devel mailing list
> devel mailing list