Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] btl tcp port to xensocket
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-01-17 09:42:41


On Jan 15, 2008, at 6:07 PM, Muhammad Atif wrote:

> Just for reference, I am trying to port btl/tcp to xensockets. Now
> if i want to do modex send/recv , to my understanding,
> mca_btl_tcp_addr_t is used (ref code/function is
> mca_btl_tcp_component_exchange). For xensockets, I need to send only
> one additional integer remote_domU_id across to say all the peers
> (in refined code it would be specific to each domain, but i just
> want to have clear understanding before i move any further). Now I
> have changed the struct mca_btl_tcp_addr_t present in btl_tcp_addr.h
> and have added int r_domu_id. This makes the size of structure 12.
> Upon receive mca_btl_tcp_proc_create() gives an error after
> mca_pml_base_modex_recv() and at this statement if(0 != (size %
> sizeof(mca_btl_tcp_addr_t))) that size do not match. It is still
> expecting size 8, where as i have made the size 12. I am unable to
> pin point the exact location where the size 8 is still embedded. Any
> ideas?

Just to be clear -- you have copied the tcp btl to another new name
and are modifying that, right? E.g., ompi/mca/btl/xensocket?

If so, you need to modify the prefix of all the symbols to be
btl_xensocket, and ensure to change the string name of your component
in the component sturcture. The modex indexes off this string name,
so it's important that it doesn't share a name with any other
component in the framework.

> Second question is regarding the receive part of openmpi. In my
> understanding, once Recv api is called, the control goes through PML
> layer and everything initializes there. However, I am unable to get
> a lock at the layer/file/function where the receive socket polling
> is done. There are callbacks, but where or how exactly the openMPI
> knows that message has in fact arrived. Any pointer will do :)

Which receive are you asking about here -- BTL receive or the modex
receive?

>
>
> Best Regards,
> Muhammad Atif
> PS: Sorry if my questions are too basic.
>
> ----- Original Message ----
> From: Jeff Squyres <jsquyres_at_[hidden]>
> To: Open MPI Developers <devel_at_[hidden]>
> Sent: Friday, January 11, 2008 1:02:31 PM
> Subject: Re: [OMPI devel] btl tcp port to xensocket
>
>
> On Jan 10, 2008, at 8:40 PM, Muhammad Atif wrote:
>
> > Hi,
> > Thanks for such a detailed reply. You are right, we have partitioned
> > (normalized) our system with Xen and have seen that virtualization
> > overhead is not that great (for some applications) as compared to
> > potential benefits that we can get. We have executed various
> > benchmarks on different network/cluster configuration of Xen and
> > Native linux and they are really encouraging. The only known problem
> > is inter-domain communication of Xen which is quite poor (1/6 of the
> > native memory transfer and not to mention 50%CPU utilization of
> > host). We have tested out Xensocket, and these sockets give us
> > really good performance boost in all respects.
> > Now that I am having a look at the complex yet wonderful
> > architecture of openmpi, can you guys give me some guidance on
> > couple of naive questions?
> >
> > 1- How do I view the console output of an mpi process which is not
> > at headnode? Do I have to have some parallel debugger? Or is there
> > any magical technique?
>
> OMPI's run-time environment takes care of redirection stdout/stderr
> from each MPI process to the stdout/stderr of mpirun for you (this is
> another use of the "out of band" TCP channel that is setup between
> mpirun and all the MPI processes).
>
> >
> > 2- How do i setup GPR?
>
> You don't. The GPR is automatically instantiated in mpirun upon
> startup.
>
> > say i have a struct foo, and all processes have at least one such
> > instance of foo. From what I gather, openmpi will create a linked
> > list of foo's that were passed on (though I am unable to pass one
> > on). Where do i have to define struct foo so that it can be
> > exchanged b/w the processes? I know its a lame question, but I think
> > i am getting lost in the sea. :(
>
> I assume you're asking about the modex.
>
> Every BTL defines its own data that is passed around in the modex. It
> is assumed that only modules of the same BTL type will be able to
> read/
> understand that data. The modex just treats the data as a blob; all
> the modex blobs are gathered into mpirun and then broadcast out to
> every MPI process (I said scatter in my previous mail; broadcast is
> more accurate).
>
> So when you modex_send, you just pass a pointer to a chunk of memory
> and a length (e.g., a pointer to a struct instance and a length).
> When you modex_receive, you can just dereference the blob that you
> return as the same struct type as you modex_send'ed previously
> (because you can only receive blobs from BTL modules that are the same
> type as you, and therefore the data they sent is the same type of data
> that you sent).
>
> You can do more complex things in the modex if you need to, of
> course. For example, we're changing the openib BTL to send variable
> length data in the modex, but that requires a bit more setup and I
> suspect you don't need to do this.
>
> >
> > Best Regards,
> > Muhammad Atif
> > PS: I am totally new to MPI internals. So if at all we decide to go
> > ahead with the project, I would be regular bugger in the list.
>
> That's what we're here for. We don't always reply immediately, but we
> try. :-)
>
> >
> > ----- Original Message ----
> > From: Adrian Knoth <adi_at_[hidden]>
> > To: Open MPI Developers <devel_at_[hidden]>
> > Sent: Thursday, January 10, 2008 1:24:01 AM
> > Subject: Re: [OMPI devel] btl tcp port to xensocket
> >
> > On Tue, Jan 08, 2008 at 10:51:45PM -0800, Muhammad Atif wrote:
> >
> > > I am planning to port tcp component to xensocket, which is a fast
> > > interdomain communication mechanism for guest domains in Xen. I
> may
> >
> > Just to get things right: You first partition your SMP/Multicore
> > system
> > with Xen, and then want to re-combine it later for MPI
> communication?
> >
> > Wouldn't it be easier to leave the unpartitioned host alone and use
> > shared memory communication instead?
> >
> > > As per design, and the fact that these sockets are not normal
> > sockets,
> > > I have to pass certain information (basically memory references,
> > guest
> > > domain info etc) to other peers once sockets have been created. I
> >
> > There's ORTE, the runtime environment. It employs OOB/tcp to have
> a so
> > called out-of-band channel. ORTE also provides a general purpose
> > registry (GPR).
> >
> > Once a TCP connection between the headnode process and all other
> peers
> > is established, you can store your required information in the GPR.
> >
> > > understand that mca_pml_base_modex_send and recv (or simply using
> > > mca_btl_tcp_component_exchange) can be used to exchange
> information,
> >
> > Use mca_pml_base_modex_send (now ompi_modex_send) and encode your
> > required information. It's getting stored in the GPR. Read it back
> > with
> > mca_pml_base_modex_recv (ompi_modex_recv), as it is done in
> > mca_btl_tcp_component_exchange and mca_btl_tcp_proc_create.
> >
> > > but I cannot seem to get them to communicate. So to put my
> > question in
> > > a very simple way..... I want to create a socket structure
> > containing
> > > necessary information, and then pass it to all other peers before
> > > start of actual mpi communication. What is the easiest way to do
> it.
> >
> >
> > Quite the same way. mca_btl_tcp_component_exchange assembles the
> > required information and stores it in the GPR by calling
> > ompi_modex_send.
> >
> > mca_btl_tcp_proc_create (think of "the other peers") reads this
> > information into local context.
> >
> >
> > I guess you might want to copy btl/tcp to let's say btl/xen, so you
> > can
> > modify internal structures, if required. Perhaps xensockets don't
> need
> > IP addresses, as they are actually memory sockets.
> >
> > However, you'll still need TCP communication between Xen guests for
> > the
> > OOB channel.
> >
> >
> > As mentioned above, I'm not sure if it's reasonable to use Xen and
> MPI
> > at all. Virtualization overhead might decrease your performance, and
> > that's usually the last thing you want to have when using MPI ;)
> >
> >
> > HTH
> >
> > --
> > Cluster and Metacomputing Working Group
> > Friedrich-Schiller-Universität Jena, Germany
> >
> > private: http://adi.thur.de
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
> > Be a better friend, newshound, and know-it-all with Yahoo! Mobile.
> > Try it now._______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> Never miss a thing. Make Yahoo your homepage.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems