Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Changing BTLs at runtime
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-03-25 15:32:28


On Mar 23, 2010, at 4:02 AM, Christoph Konersmann wrote:

> It was long ago where I've asked about hints to implement a dynamic BTL
> control. I've currently managed to change the MPI communication path
> from a BTL module (e.g. openib) to another BTL module (e.g. tcp) at
> runtime of a distributed application.
>
> For this I've developed a so called BTL Control Client (orte-btlctl) to
> send control messages to all processes through the ORTE RML.

Cool!

FWIW, you might want to name it ompi-btlctl. ORTE is our run-time layer and has no knowledge of the BTL's.

> These
> messages are received and processed in the OMPI BML. In BML I've
> implemented a function to stop the MPI communication and another for
> changing the BTL exclusivity and recalculating the btl_{send,eager,rdma}
> lists. All is done at runtime so a distributed application running with
> Open MPI is not affected in its computation.
>
> I also managed to unload a module not used anymore, e.g. openib after
> changing the MPI communication to tcp, through the already implemented
> function mca_bml_r2_del_btl(mca_btl_base_module_t* btl).

Sounds great!

> The Question:
> The function to (re)initialise a BTL module
> "mca_bml_r2_add_btl(mca_btl_base_module_t* btl)" is currently not
> implemented. Why is it not implemented? And what has to be done if I
> want to implement it?

I'm actually not sure -- this is not an area of the code where I am an expert...

It looks like the r2 proc_add is calling the internal function add_btls (plural). I don't know where in the code base calls bml.add_btl...? (does anywhere call it?) It may have been planned but then never used...?

> As far as I understood the internals of the OMPI Layer, for adding a BTL
> module you have to implement the following steps:
> 1. find the corresponding component in mca_btl_base_components_opened
> 2. Do component->btl_init to get an array of BTL modules
> 3. and add those to mca_btl_base_modules_initialized
> 4. Iterate through mca_btl_base_modules_initialized and add BTL module
> to mca_bml_r2.btl_modules in bml_r2
> 5. Add BTL module to btl_{send,eager,rdma} (if applicable) for all
> reachable procs

This *sounds* right, but again, I'm not the expert in this part of the code base.

> The Background:
> I should give some background, why I'm implementing this. Changing the
> MPI communication from a high speed network to a network with
> flowcontrol (openib->tcp) is necessary for checkpointing distributed
> applications in virtual machines. Ok, you are able to checkpoint through
> the FT-Framework and BLCR in Open MPI, but virtual machines already
> provide trivial functions for checkpointing. As you are not able to
> checkpoint the hardware information of e.g. openib you have to get rid
> of it in case of a checkpoint, and change back again on resume/continue.

I'm not quite sure I understand. I can see how the original model of CRS and SNAPC don't quite fit that of VM's, but I don't quite understand what switching openib -> tcp and then later tcp -> openib gives you...?

Can't you just quiesce the openib BTL, let the VM checkpoint, and then resume with openib? (or whatever other non TCP/sm BTL you want)

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/