Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Shared memory optimizations in OMPI
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-11-22 10:46:58


All the shared memory code is in the "sm" BTL (byte transfer layer) component: ompi/mca/btl/sm. All the TCP MPI code is in the "tcp" BTL component: ompi/mca/btl/tcp. Think of "ob1" as the MPI engine that is the bottom of MPI_SEND, MPI_RECV, and friends. It takes a message to be sent, determines how many BTLs can be used to send it, fragments the message as appropriate, and chooses from one of several different protocols to actually send the message. It then hands off the fragments of that message to the underlying BTLs to effect the actual transfer.

So ob1 has no knowledge of shared memory of TCP directly -- it relies on the BTLs to say "yes, I can reach peer X at priority Y". For example, both TCP and sm will respond that they can reach a peer that is on the same server node. But sm will have a higher priority, so it will get all the fragments destined for that process, and TCP will be ignored.

Remember: all of this is setup during MPI_INIT. During MPI_SEND (and friends), ob1 (and r2, the BML (BTL multiplexing layer)) is just looking up arrays of pointers and invoking function pointers that were previously setup.

So you can look into ob1, but be aware that it's all done by function pointers and indirection.

Your best bet might well be to look at individual function names in the TCP and SM BTLs and set breakpoints on those. The file ompi/mca/btl/btl.h provides descriptions of what each of the publicly exported functions from each of the BTL components do; this will give you information about what the functions in the TCP and SM BTLs are doing.

On Nov 22, 2011, at 10:12 AM, Shamik Ganguly wrote:

> Thanks a lot Jeff.
>
> PIN is a dynamic binary instrumentation tool from Intel. It runs on top of the Binary in the MPI node. When its given function calls to instrument, it will insert trappings before/after that funtion call in the binary of the program you are instrumenting and you can insert your own functions.
>
> I am doing some memory address profiling on benchmarks running on MPI and I was using PIN to get the Load/Store addresses. Furthermore I needed to know which LD/ST were coming from actual socket communication and which are coming from shared memory optimizations. So i needed to know which functions/where exactly were they taking that decision so that I can instrument the appropriate MPI library function call (the actual low level function, not the API like MPI_Sends/Recvs) in PIN. Hence I guess I am actually zooming down to a 1000 ft view :)
>
> Any suggestion is welcome. I will go into the ob1 directory and try to hunt around to see how exactly its being done.
>
> Regards,
> Shamik
>
> On Tue, Nov 22, 2011 at 10:08 AM, Shamik Ganguly <shamik.ganguly_at_[hidden]> wrote:
> Thanks a lot Jeff.
>
> PIN is a dynamic binary instrumentation tool from Intel. It runs on top of the Binary in the MPI node. When its given function calls to instrument, it will insert trappings before/after that funtion call in the binary of the program you are instrumenting and you can insert your own functions.
>
> I am doing some memory address profiling on benchmarks running on MPI and I was using PIN to get the Load/Store addresses. Furthermore I needed to know which LD/ST were coming from actual socket communication and which are coming from shared memory optimizations. So i needed to know which functions/where exactly were they taking that decision so that I can instrument the appropriate MPI library function call (the actual low level function, not the API like MPI_Sends/Recvs) in PIN. Hence I guess I am actually zooming down to a 1000 ft view :)
>
> I will go into the ob1 directory and try to hunt around to see how exactly its being done.
>
> Regards,
> Shamik
>
>
> On Tue, Nov 22, 2011 at 9:04 AM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> On Nov 22, 2011, at 1:09 AM, Shamik Ganguly wrote:
>
> > I want to trace when the MPI library prevents an MPI_Send from going to the socket and makes it access shared memory because the target node is on the same chip (CMP). I want to use PIN to trace this. Can you please give me some pointers about which functions are taking this decision so that I can instrument the appropriate library calls in PIN?
>
> What's PIN?
>
> The decision is made in the ob1 PML plugin. Way back during MPI_INIT, each MPI process creates lists of BTLs to use to contact each MPI process peer.
>
> When a process is on the same *node* (e.g., a single server) -- not just the same processor socket/chip -- the shared memory BTL is given preference to all other BTLs by use of a priority mechanism. Hence, the "sm" BTL is put at the front of the BML lists (BML = BTL multiplexing layer -- it's essentially just list management for BTLs).
>
> Later, when MPI_SEND comes through, it uses the already-setup BML lists to determine which BTL(s) to use to send a message.
>
> That's the 50,000 foot view.
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> Shamik Ganguly
>
>
>
>
> --
> Shamik Ganguly
> 2nd year, MS (CSE-Hardware), University of Michigan, Ann Arbor
> B.Tech.(E&ECE), IITKGP (2008)
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/