I'm working on an OpenMPI BTL component and am having a recurring problem, I
was wondering if anyone could shed some light on it. I have a component
that's quite straight forward, it uses a pair of lightweight sockets to take
advantage of being in a virtualised environment (specifically Xen). My code
is a bit messy and has lots of inefficiencies, but the logic seems sound
enough. I've been able to execute a few simple programs successfully using
the component, and they work most of the time.
The problem I'm having is actually happening in higher layers, specifically
in my asynchronous receive handler, when I call the callback function
(cbfunc) that was set by the PML in the BTL initialisation phase. It seems
to be getting stuck in an infinite loop at __ompi_free_list_wait(), in this
function there is a condition variable which should get set eventually but
just doesn't. I've stepped through it with GDB and I get a backtrace of
something like this:
mca_btl_xen_endpoint_recv_handler -> mca_btl_xen_endpoint_start_recv ->
mca_pml_ob1_recv_frag_callback -> mca_pml_ob1_recv_frag_match ->
__ompi_free_list_wait -> opal_condition_wait
and from there it just loops. Although this is happening in higher levels, I
haven't noticed something like this happening in any of the other BTL
components so chances are there's something in my code that's causing this.
I very much doubt that it's actually waiting for a list item to be returned
since this infinite loop can occur non deterministically and sometimes even
on the first receive callback.
I'm really not too sure what else to include with this e-mail. I could send
my source code (a bit nasty right now) if it would be helpful, but I'm
hoping that someone might have noticed this problem before or something
similar. Maybe I'm making a common mistake. Any advice would be really
I'm using OpenMPI 1.2.9 from the SVN tag repository.