Let me throw in a few clarifications on top of Adrian's information...
On Jan 9, 2008, at 9:24 AM, Adrian Knoth wrote:
> On Tue, Jan 08, 2008 at 10:51:45PM -0800, Muhammad Atif wrote:
>> I am planning to port tcp component to xensocket, which is a fast
>> interdomain communication mechanism for guest domains in Xen. I may
> Just to get things right: You first partition your SMP/Multicore
> with Xen, and then want to re-combine it later for MPI communication?
> Wouldn't it be easier to leave the unpartitioned host alone and use
> shared memory communication instead?
>> As per design, and the fact that these sockets are not normal
>> I have to pass certain information (basically memory references,
>> domain info etc) to other peers once sockets have been created. I
> There's ORTE, the runtime environment. It employs OOB/tcp to have a so
> called out-of-band channel. ORTE also provides a general purpose
> registry (GPR).
> Once a TCP connection between the headnode process and all other peers
> is established, you can store your required information in the GPR.
mpirun automatically setups up TCP connections between the launched
MPI processes and itself. This "out of band" TCP channel is used for
several things, such as exchanging meta information during startup
(e.g., the modex information).
>> understand that mca_pml_base_modex_send and recv (or simply using
>> mca_btl_tcp_component_exchange) can be used to exchange information,
> Use mca_pml_base_modex_send (now ompi_modex_send) and encode your
> required information. It's getting stored in the GPR. Read it back
> mca_pml_base_modex_recv (ompi_modex_recv), as it is done in
> mca_btl_tcp_component_exchange and mca_btl_tcp_proc_create.
Note that there is a very specific model for the modex; each BTL
module can only do a single send and then you can do a "receive" later
to get the information from your peer BTL modules in other MPI
processes. Note that the modex information is not actually sent
around during the modex send/receive function calls -- modex_send is
just a buffering function for a gather operation that occurs later
(all the data is gathered to mpirun). Later during MPI_INIT, mpirun
scatters all the information to the MPI processes. Eventually, the
BTL modules will call modex_recv to get their peer BTL information,
but this is just a memcpy from the data that was previously received
during the scatter from mpirun during MPI_INIT.
>> but I cannot seem to get them to communicate. So to put my question
>> a very simple way..... I want to create a socket structure containing
>> necessary information, and then pass it to all other peers before
>> start of actual mpi communication. What is the easiest way to do it.
> Quite the same way. mca_btl_tcp_component_exchange assembles the
> required information and stores it in the GPR by calling
> mca_btl_tcp_proc_create (think of "the other peers") reads this
> information into local context.
> I guess you might want to copy btl/tcp to let's say btl/xen, so you
> modify internal structures, if required. Perhaps xensockets don't need
> IP addresses, as they are actually memory sockets.
> However, you'll still need TCP communication between Xen guests for
> OOB channel.
> As mentioned above, I'm not sure if it's reasonable to use Xen and MPI
> at all. Virtualization overhead might decrease your performance, and
> that's usually the last thing you want to have when using MPI ;)
> Cluster and Metacomputing Working Group
> Friedrich-Schiller-Universität Jena, Germany
> private: http://adi.thur.de
> devel mailing list