Thanks, that was actually a lot of help, I had very little understanding of
the bynode and byslot thingy, thanks
On 6/5/08, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> On May 23, 2008, at 9:07 PM, Cally K wrote:
> > Hi, I have a question about --bynode and --byslot that i would like
> > to clarify
> > Say, for example, I have a hostfile
> > #Hostfile
> > __________________________
> > node0
> > node1 slots=2 max_slots=2
> > node2 slots=2 max_slots=2
> > node3 slots=4 max_slots=4
> > ___________________________
> > There are 4 nodes and 9 slots, how do I run my mpirun, for now I use
> > a) mpirun -np --bynode 4 ./abcd
> I assume you mean "... -np 4 --bynode ..."
> > I know that the slot thingy is for SMPs, and I have tried running
> > mpirun -np --byslot 9 ./abcd
> > and I noticed that its longer when I do --byslot when compared to --
> > bynode
> According to your text, you're running 9 processes when using --byslot
> and 4 when using --bynode. Is that a typo? I'll assume that it is --
> that you meant to use 9 in both cases.
> > and I just read the faq that said, by defauly the byslot option is
> > used, so I dun have to use it rite,,,
> I'm not sure what your question is. The actual performance may depend
> on your application and what its communication and computation
> patterns are. It gets more difficult to model when you have a
> heterogeneous setup (like it looks like you have, per your hostfile).
> Let's take your example of 9 processes.
> - With --bynode, the MPI_COMM_WORLD ranks will be laid out as follows
> (MCRW = "MPI_COMM_WORLD rank")
> node0: MCWR 0
> node1: MCWR 1, MCWR 4
> node2: MCWR 2, MCWR 5
> node3: MCRW 3, MCRW 6, MCWR 7, MCWR 8
> - With --byslot, it'll look like this:
> node0: MCWR 0
> node1: MCWR 1, MCWR 2
> node2: MCWR 3, MCWR 4
> node3: MCRW 5, MCRW 6, MCWR 7, MCWR 8
> In short, OMPI is doing round-robin placement of your processes; the
> only difference is in which dimension is traversed first: by node or
> by slot.
> As to why there's such a performance difference, it could depend on a
> lot of things: the difference in computational speed and/or RAM on
> your 4 nodes, the changing communication patterns between the two
> (shared memory is usually used for on-node communication, which is
> usually faster than most networks), etc. It really depends on what
> your application is *doing*.
> Sorry I can't be of more help...
> Jeff Squyres
> Cisco Systems
> users mailing list