Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] --bynode vs --byslot
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-06-04 13:07:46


On May 23, 2008, at 9:07 PM, Cally K wrote:

> Hi, I have a question about --bynode and --byslot that i would like
> to clarify
>
> Say, for example, I have a hostfile
>
> #Hostfile
>
> __________________________
> node0
> node1 slots=2 max_slots=2
> node2 slots=2 max_slots=2
> node3 slots=4 max_slots=4
> ___________________________
>
> There are 4 nodes and 9 slots, how do I run my mpirun, for now I use
>
> a) mpirun -np --bynode 4 ./abcd

I assume you mean "... -np 4 --bynode ..."

> I know that the slot thingy is for SMPs, and I have tried running
> mpirun -np --byslot 9 ./abcd
>
> and I noticed that its longer when I do --byslot when compared to --
> bynode

According to your text, you're running 9 processes when using --byslot
and 4 when using --bynode. Is that a typo? I'll assume that it is --
that you meant to use 9 in both cases.

> and I just read the faq that said, by defauly the byslot option is
> used, so I dun have to use it rite,,,

I'm not sure what your question is. The actual performance may depend
on your application and what its communication and computation
patterns are. It gets more difficult to model when you have a
heterogeneous setup (like it looks like you have, per your hostfile).

Let's take your example of 9 processes.

- With --bynode, the MPI_COMM_WORLD ranks will be laid out as follows
(MCRW = "MPI_COMM_WORLD rank")

node0: MCWR 0
node1: MCWR 1, MCWR 4
node2: MCWR 2, MCWR 5
node3: MCRW 3, MCRW 6, MCWR 7, MCWR 8

- With --byslot, it'll look like this:

node0: MCWR 0
node1: MCWR 1, MCWR 2
node2: MCWR 3, MCWR 4
node3: MCRW 5, MCRW 6, MCWR 7, MCWR 8

In short, OMPI is doing round-robin placement of your processes; the
only difference is in which dimension is traversed first: by node or
by slot.

As to why there's such a performance difference, it could depend on a
lot of things: the difference in computational speed and/or RAM on
your 4 nodes, the changing communication patterns between the two
(shared memory is usually used for on-node communication, which is
usually faster than most networks), etc. It really depends on what
your application is *doing*.

Sorry I can't be of more help...

-- 
Jeff Squyres
Cisco Systems