Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Changing BTLs at runtime
From: Christoph Konersmann (c_k_at_[hidden])
Date: 2010-03-26 09:46:22


Hi,

Thanks for your reply and your suggestions, I'll try to give more
detailed information.

Am 25.03.2010 20:32, schrieb Jeff Squyres:
> On Mar 23, 2010, at 4:02 AM, Christoph Konersmann wrote:
>
>> It was long ago where I've asked about hints to implement a dynamic BTL
>> control. I've currently managed to change the MPI communication path
>> from a BTL module (e.g. openib) to another BTL module (e.g. tcp) at
>> runtime of a distributed application.
>>
>> For this I've developed a so called BTL Control Client (orte-btlctl) to
>> send control messages to all processes through the ORTE RML.
>
> Cool!
>
> FWIW, you might want to name it ompi-btlctl. ORTE is our run-time layer and has no knowledge of the BTL's.

My first problem, which had to solve, was, how to send those commands
directly to the BML in all procs. The solution I've implemented is
nearly the same as Ralph Castain has mentioned. The orte-btlctl sends
its command to the ORTE daemon, which is then forwarded through
orte_grpcomm.xcast() to all procs. This is done in orte/orted/orted_comm.c.
A running recv_callback function in BML receives the specially tagged
command and executes it. This callback function also got the information
which rml_uri the orte-btlctl tool has, so all answers are directly sent
back to the control client. For the reason that this tool depends on the
ORTE daemon, I've just called it orte-btlctl... But it might be changed
to any other name. :)

>
>> These
>> messages are received and processed in the OMPI BML. In BML I've
>> implemented a function to stop the MPI communication and another for
>> changing the BTL exclusivity and recalculating the btl_{send,eager,rdma}
>> lists. All is done at runtime so a distributed application running with
>> Open MPI is not affected in its computation.
>>
>> I also managed to unload a module not used anymore, e.g. openib after
>> changing the MPI communication to tcp, through the already implemented
>> function mca_bml_r2_del_btl(mca_btl_base_module_t* btl).
>
> Sounds great!
>
>> The Question:
>> The function to (re)initialise a BTL module
>> "mca_bml_r2_add_btl(mca_btl_base_module_t* btl)" is currently not
>> implemented. Why is it not implemented? And what has to be done if I
>> want to implement it?
>
> I'm actually not sure -- this is not an area of the code where I am an expert...
>
> It looks like the r2 proc_add is calling the internal function add_btls (plural). I don't know where in the code base calls bml.add_btl...? (does anywhere call it?) It may have been planned but then never used...?

No, I haven't found any code snippet which calls this function... Maybe
there was just no need for it...

>
>> As far as I understood the internals of the OMPI Layer, for adding a BTL
>> module you have to implement the following steps:
>> 1. find the corresponding component in mca_btl_base_components_opened
>> 2. Do component->btl_init to get an array of BTL modules
>> 3. and add those to mca_btl_base_modules_initialized
>> 4. Iterate through mca_btl_base_modules_initialized and add BTL module
>> to mca_bml_r2.btl_modules in bml_r2
>> 5. Add BTL module to btl_{send,eager,rdma} (if applicable) for all
>> reachable procs
>
> This *sounds* right, but again, I'm not the expert in this part of the code base.

I currently have an experimental function called
mca_bml_r2_add_btl_by_name(char* btl_name); which is under construction
and not really working yet.

It's current tasks are:
1. Search for the given component name and initialize the modules for
each available interface
2. Add those BTL modules to the list of initialized modules in
mca_btl_base_modules_initialized
3. For each initialized BTL module which component name is the given
btl_name do:
        1. Add BTL to the list mca_bml_r2.btl_modules
        2. For each process do:
                1. Add this btl to the list of btl_send in bml_endpoint if reachable
                2. Set btl_mpool = NULL
                3. Reset other attributes and set btl_endpoint, at least try to set it...
                4. Register btl_progress
4. Recalculate the lists btl_send, btl_eager and btl_rdma to make sure
the highest exclusivity is used.

I think that the challenging part is to set the endpoints in bml and btl
in all procs. Maybe I'm missing some stuff related to the memory pool or
other components. I don't really know what the exact problem is, but I
know that the BTL module will cause a segfault on changing the mpi
communication back to it with the first package to receive.

>
>> The Background:
>> I should give some background, why I'm implementing this. Changing the
>> MPI communication from a high speed network to a network with
>> flowcontrol (openib->tcp) is necessary for checkpointing distributed
>> applications in virtual machines. Ok, you are able to checkpoint through
>> the FT-Framework and BLCR in Open MPI, but virtual machines already
>> provide trivial functions for checkpointing. As you are not able to
>> checkpoint the hardware information of e.g. openib you have to get rid
>> of it in case of a checkpoint, and change back again on resume/continue.
>
> I'm not quite sure I understand. I can see how the original model of CRS and SNAPC don't quite fit that of VM's, but I don't quite understand what switching openib -> tcp and then later tcp -> openib gives you...?
>
> Can't you just quiesce the openib BTL, let the VM checkpoint, and then resume with openib? (or whatever other non TCP/sm BTL you want)
>

I worked under the assumption that a virtualization might support
InfiniBand through SR-IOV. So every virtual machine has the possibility
to use it at full speed. Just starving out the communication between
InfiniBand devices would not help in case of migration when the
underlying hardware and its configuration would change. Therefore I have
to unload the desired BTL module. To make sure that absolutely no bml
uses infiniband for transfer anymore, I change the communication to
another device whose protocol is known to work with migrating virtual
machines, like tcp.

Checkpointing would work with just quiesce the communication if the
infiniband hardware will not changed.

Kind regards,
Christoph Konersmann

-- 
Paderborn Center for Parallel Computing - PC2
University of Paderborn - Germany
http://www.pc2.de
Christoph Konersmann <c_k_at_[hidden]>