Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] Infinite Loop: ompi_free_list_wait
From: Timothy Hayes (hayesti_at_[hidden])
Date: 2009-03-23 14:11:32

That's a relief to know, although I'm still a bit concerned. I'm looking at
the code for the OpenMPI 1.3 trunk and in the ob1 component I can see the
following sequence:

mca_pml_ob1_recv_frag_callback_match -> append_frag_to_list ->
MCA_PML_OB1_RECV_FRAG_ALLOC -> OMPI_FREE_LIST_WAIT -> __ompi_free_list_wait

so I'm guessing unless the deadlock issue has been resolved for that
function, it will still fail non deterministically. I'm quite eager to give
it a try, but my component doesn't compile as is with the 1.3 source. Is it
trivial to convert it?

Or maybe you were suggesting that I go into the code of ob1 myself and
manually change every _wait to _get?

Kind regards

2009/3/23 George Bosilca <bosilca_at_[hidden]>

> It is a known problem. When the freelist is empty going in the
> ompi_free_list_wait will block the process until at least one fragment
> became available. As a fragment can became available only when returned by
> the BTL, this can lead to deadlocks in some cases. The workaround is to ban
> the usage of the blocking _wait function, and replace it with the
> non-blocking version _get. The PML has all the required logic to deal with
> the cases where a fragment cannot be allocated. We changed most of the BTLs
> to use _get instead of _wait few months ago.
> Thanks,
> george.
> On Mar 23, 2009, at 11:58 , Timothy Hayes wrote:
> Hello,
>> I'm working on an OpenMPI BTL component and am having a recurring problem,
>> I was wondering if anyone could shed some light on it. I have a component
>> that's quite straight forward, it uses a pair of lightweight sockets to take
>> advantage of being in a virtualised environment (specifically Xen). My code
>> is a bit messy and has lots of inefficiencies, but the logic seems sound
>> enough. I've been able to execute a few simple programs successfully using
>> the component, and they work most of the time.
>> The problem I'm having is actually happening in higher layers,
>> specifically in my asynchronous receive handler, when I call the callback
>> function (cbfunc) that was set by the PML in the BTL initialisation phase.
>> It seems to be getting stuck in an infinite loop at __ompi_free_list_wait(),
>> in this function there is a condition variable which should get set
>> eventually but just doesn't. I've stepped through it with GDB and I get a
>> backtrace of something like this:
>> mca_btl_xen_endpoint_recv_handler -> mca_btl_xen_endpoint_start_recv ->
>> mca_pml_ob1_recv_frag_callback -> mca_pml_ob1_recv_frag_match ->
>> __ompi_free_list_wait -> opal_condition_wait
>> and from there it just loops. Although this is happening in higher levels,
>> I haven't noticed something like this happening in any of the other BTL
>> components so chances are there's something in my code that's causing this.
>> I very much doubt that it's actually waiting for a list item to be returned
>> since this infinite loop can occur non deterministically and sometimes even
>> on the first receive callback.
>> I'm really not too sure what else to include with this e-mail. I could
>> send my source code (a bit nasty right now) if it would be helpful, but I'm
>> hoping that someone might have noticed this problem before or something
>> similar. Maybe I'm making a common mistake. Any advice would be really
>> appreciated!
>> I'm using OpenMPI 1.2.9 from the SVN tag repository.
>> Kind regards
>> Tim Hayes
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]