Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Infinite Loop: ompi_free_list_wait
From: George Bosilca (bosilca_at_[hidden])
Date: 2009-03-23 13:28:17

It is a known problem. When the freelist is empty going in the
ompi_free_list_wait will block the process until at least one fragment
became available. As a fragment can became available only when
returned by the BTL, this can lead to deadlocks in some cases. The
workaround is to ban the usage of the blocking _wait function, and
replace it with the non-blocking version _get. The PML has all the
required logic to deal with the cases where a fragment cannot be
allocated. We changed most of the BTLs to use _get instead of _wait
few months ago.


On Mar 23, 2009, at 11:58 , Timothy Hayes wrote:

> Hello,
> I'm working on an OpenMPI BTL component and am having a recurring
> problem, I was wondering if anyone could shed some light on it. I
> have a component that's quite straight forward, it uses a pair of
> lightweight sockets to take advantage of being in a virtualised
> environment (specifically Xen). My code is a bit messy and has lots
> of inefficiencies, but the logic seems sound enough. I've been able
> to execute a few simple programs successfully using the component,
> and they work most of the time.
> The problem I'm having is actually happening in higher layers,
> specifically in my asynchronous receive handler, when I call the
> callback function (cbfunc) that was set by the PML in the BTL
> initialisation phase. It seems to be getting stuck in an infinite
> loop at __ompi_free_list_wait(), in this function there is a
> condition variable which should get set eventually but just doesn't.
> I've stepped through it with GDB and I get a backtrace of something
> like this:
> mca_btl_xen_endpoint_recv_handler -> mca_btl_xen_endpoint_start_recv
> -> mca_pml_ob1_recv_frag_callback -> mca_pml_ob1_recv_frag_match ->
> __ompi_free_list_wait -> opal_condition_wait
> and from there it just loops. Although this is happening in higher
> levels, I haven't noticed something like this happening in any of
> the other BTL components so chances are there's something in my code
> that's causing this. I very much doubt that it's actually waiting
> for a list item to be returned since this infinite loop can occur
> non deterministically and sometimes even on the first receive
> callback.
> I'm really not too sure what else to include with this e-mail. I
> could send my source code (a bit nasty right now) if it would be
> helpful, but I'm hoping that someone might have noticed this problem
> before or something similar. Maybe I'm making a common mistake. Any
> advice would be really appreciated!
> I'm using OpenMPI 1.2.9 from the SVN tag repository.
> Kind regards
> Tim Hayes
> _______________________________________________
> devel mailing list
> devel_at_[hidden]