Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] reasons for jitter other than paffinity
From: Javier Fernández (javier_at_[hidden])
Date: 2010-06-01 04:21:42


Hello,

I was running the pi demo on multicores (Pentium Dual, Core i7) to see
it scale, but sometimes the time measurements return disparate results.
The FAQ suggests processor affinity as one possible reason for that. For
instance, the demo takes 3s in one core of the Pentium Dual
> $ mpirun -h
> mpirun (Open MPI) 1.4.2
> ...
> $ mpirun -c 1 pi
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 1 2.963499 6.332712E-13

and approx. half the time in both cores, sometimes
> $ mpirun -c 2 pi
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 2 1.535394 2.291500E-13
> $ mpirun -c 2 pi
> ...
> 2 2.497539 2.291500E-13

Say 6/10 times it takes 2.5s, 4/10 it takes 1.5s.

So I followed FAQ remarks about paffinity
> $ mpirun -mca mpi_paffinity_alone 1 -c 2 pi
> ...
> 2 1.496346 2.291500E-13
> $ mpirun -mca mpi_paffinity_alone 1 -c 2 pi
> ...
> 2 2.527654 2.291500E-13

Say 2/10 times it takes 2.5s. I'm not sure that's the way it's expected
to work, getting the not-bound-to-core time once in five tries.

I guess I'm doing something wrong. I tried to discover if the ranks are
being effectively bound. It seems they are.
> $ mpirun -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
> [nodo0:03233] [[51755,0],0] odls:default:fork binding child [[51755,1],0] to cpus 0001
> [nodo0:03233] [[51755,0],0] odls:default:fork binding child [[51755,1],1] to cpus 0002
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 2 1.536590 2.291500E-13
> $ mpirun -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
> [nodo0:03236] [[51758,0],0] odls:default:fork binding child [[51758,1],0] to cpus 0001
> [nodo0:03236] [[51758,0],0] odls:default:fork binding child [[51758,1],1] to cpus 0002
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 2 2.556353 2.291500E-13

Then I thought it might be related to ompi believing the node was being
oversubscribed and thus entering Degraded Mode instead of Agressive
(altough this pi demo has only one send-recv at the end of the code, so
it's hard to believe 1s diff is due to that). Cool this -nooversubscribe
switch, to quickly make sure about it
> i$ mpirun -nooversubscribe -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 2 slots
> that were requested by the application:

If I correctly understand, slots can be only specified in a hostfile,
not in mpirun switches. Is that true?
> $ cat hf
> n0 slots=2 max-slots=2
> $ mpirun -hostfile hf -mca mpi_paffinity_alone 1 -report-bindings -c 3 pi
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 3 slots
> that were requested by the application:
> ...
> cpd_at_nodo0:~/ompi$ mpirun -hostfile hf -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
> [nodo0:03444] [[52222,0],0] odls:default:fork binding child [[52222,1],0] to cpus 0001
> [nodo0:03444] [[52222,0],0] odls:default:fork binding child [[52222,1],1] to cpus 0002
> ^[[A
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 2 1.502448 2.291500E-13
> cpd_at_nodo0:~/ompi$ mpirun -hostfile hf -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
> [nodo0:03447] [[52221,0],0] odls:default:fork binding child [[52221,1],0] to cpus 0001
> [nodo0:03447] [[52221,0],0] odls:default:fork binding child [[52221,1],1] to cpus 0002
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 2 2.540400 2.291500E-13

Sometimes you need 10-15 tries to get 2.5s. That's much more reliable,
but I'm not sure I'm getting it right. Is it the expected way it works?
Is it normal to get the not-bound-to-core time once in 10-15 tries? Am I
missing something else?

I see there are also rankfiles, so I tried this
> $ cat rf
> rank 0=n0 slot=0:0
> rank 1=n0 slot=0:1
> cpd_at_nodo0:~/ompi$ mpirun -rf rf -report-bindings -c 2 pi
> [nodo0:03512] [[52018,0],0] odls:default:fork binding child [[52018,1],0] to slot_list 0:0
> [nodo0:03512] [[52018,0],0] odls:default:fork binding child [[52018,1],1] to slot_list 0:1
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 2 2.503292 2.291500E-13
> cpd_at_nodo0:~/ompi$ mpirun -rf rf -report-bindings -c 2 pi
> [nodo0:03515] [[52017,0],0] odls:default:fork binding child [[52017,1],0] to slot_list 0:0
> [nodo0:03515] [[52017,0],0] odls:default:fork binding child [[52017,1],1] to slot_list 0:1
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 2 1.546936 2.291500E-13

I got 4/10 times 2.5s, and the same (well, 5/10) if I also include
-hostfile hf, so it seems I am still missing something. Probably that
uncontrolled "something" is what the statistics 2/10 - 4/10 hang on.

And editing the files to point to n1, one of the Core i7 nodes...
> cpd_at_nodo0:~/ompi$ grep cores /proc/cpuinfo
> cpu cores : 2
> cpu cores : 2
> cpd_at_nodo0:~/ompi$ ssh n1 grep cores /proc/cpuinfo
> cpu cores : 4
> cpu cores : 4
> cpu cores : 4
> cpu cores : 4
> $ cat hf
> n1 slots=4 max-slots=4
> cpd_at_nodo0:~/ompi$ cat rf
> rank 0=n1 slot=0:0
> rank 1=n1 slot=0:1
> rank 2=n1 slot=0:2
> rank 3=n1 slot=0:3

The 3s sequential become 1.66s in this node
> $ mpirun -H n1 pi
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 1 1.662411 6.332712E-13
> cpd_at_nodo0:~/ompi$ mpirun -H n1 -mca mpi_paffinity_alone 1 -report-bindings -c 1 pi
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 1 1.663015 6.332712E-13

Hmf, didn't report the bindings since there are none, as I forgot to
mention the slots...

> cpd_at_nodo0:~/ompi$ mpirun -H n1 -display-devel-map -mca mpi_paffinity_alone 1 -report-bindings -c 1 pi
>
> Map generated by mapping policy: 0402
> Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE
> ...
> Data for node: Name: n1 Launch id: -1 Arch: 0 State: 2
> Num boards: 1 Num sockets/board: 1 Num cores/socket: 2
> ...
> Num slots: 1 Slots in use: 1
> Num slots allocated: 1 Max slots: 0
> ...
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 1 1.663060 6.332712E-13

...I forgot to use the hostfile to mention the slots, so ompi thinks I'm
oversubscribing a 1-slot node. It also thinks this is a dual-core node,
like the one I'm typing mpirun on. I could fix the core count from
command line, but not the oversubscription (and consequently degraded
mode). 1.66s become 1.84s with 2 ranks. Hmm, now I think of it, it's 1s
diff to the expected 0.84s...

> $ mpirun -H n1 -num-cores 4 -display-devel-map -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
>
> Map generated by mapping policy: 0402
> Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE
> ...
> Data for node: Name: n1 Launch id: -1 Arch: 0 State: 2
> Num boards: 1 Num sockets/board: 1 Num cores/socket: 4
> ...
> Num slots: 1 Slots in use: 2
> Num slots allocated: 1 Max slots: 0
> ...
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 2 1.843984 2.291500E-13

It's indeed oversubscribed

> $ mpirun -H n1 -nooversubscribe -num-cores 4 -display-devel-map -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 2 slots
> that were requested by the application:

So here goes the hostfile
> cpd_at_nodo0:~/ompi$ cat hf
> n1 slots=4 max-slots=4
> cpd_at_nodo0:~/ompi$ mpirun -hostfile hf -nooversubscribe -num-cores 4 -display-devel-map -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
>
> Map generated by mapping policy: 0402
> Npernode: 0 Oversubscribe allowed: FALSE CPU Lists: FALSE
> ...
> Data for node: Name: n1 Launch id: -1 Arch: 0 State: 0
> Num boards: 1 Num sockets/board: 1 Num cores/socket: 4
> ...
> Num slots: 4 Slots in use: 2
> Num slots allocated: 4 Max slots: 4
> ...
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 2 1.845798 2.291500E-13

Ouch! Where are my bindings? This did work for node n0!!!
> cpd_at_nodo0:~/ompi$ cat hf1
> n1 slots=4 max-slots=4
> cpd_at_nodo0:~/ompi$ cat hf0
> n0 slots=2 max-slots=2
> cpd_at_nodo0:~/ompi$ mpirun -hostfile hf0 -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
> [nodo0:03202] [[51720,0],0] odls:default:fork binding child [[51720,1],0] to cpus 0001
> [nodo0:03202] [[51720,0],0] odls:default:fork binding child [[51720,1],1] to cpus 0002
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 2 1.529327 2.291500E-13
> cpd_at_nodo0:~/ompi$ mpirun -hostfile hf1 -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 2 1.842951 2.291500E-13

Ok, I'm moving to node n1
> cpd_at_nodo0:~/ompi$ ssh n1
> cpd_at_nodo1:~$ cd ompi
> cpd_at_nodo1:~/ompi$ mpirun -hostfile hf1 -mca mpi_paffinity_alone 1 -report-bindings -c 2 pi
> [nodo1:02772] [[8858,0],0] odls:default:fork binding child [[8858,1],0] to cpus 0001
> [nodo1:02772] [[8858,0],0] odls:default:fork binding child [[8858,1],1] to cpus 0002
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 2 1.841935 2.291500E-13
> cpd_at_nodo1:~/ompi$ cat hf1
> n1 slots=4 max-slots=4
> cpd_at_nodo1:~/ompi$ grep cores /proc/cpuinfo
> cpu cores : 4
> cpu cores : 4
> cpu cores : 4
> cpu cores : 4
> cpd_at_nodo1:~/ompi$ mpirun pi
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 1 1.661314 6.332712E-13

Hmf, with 2 cores I get 1.84s 10/10 times. These results below are
10/10, perfectly reproducible without a glitch. -bind-to-core is
required to get -report-bindings to report something. Is it different
from -mca mpi_paffinity_alone 1? I find it easier to use (shorter to
type in).
> -bind-to-core|--bind-to-core
> Whether to bind processes to specific cores (the
> default)

> cpd_at_nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -report-bindings -c 2 pi
> [nodo1:02836] [[9050,0],0] odls:default:fork binding child [[9050,1],0] to cpus 0001
> [nodo1:02836] [[9050,0],0] odls:default:fork binding child [[9050,1],1] to cpus 0002
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 2 1.841275 2.291500E-13
> cpd_at_nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -report-bindings -c 3 pi
> [nodo1:02839] [[9049,0],0] odls:default:fork binding child [[9049,1],0] to cpus 0001
> [nodo1:02839] [[9049,0],0] odls:default:fork binding child [[9049,1],1] to cpus 0002
> [nodo1:02839] [[9049,0],0] odls:default:fork binding child [[9049,1],2] to cpus 0004
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 3 1.571870 3.606004E-13
> cpd_at_nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -report-bindings -c 4 pi
> [nodo1:02843] [[9045,0],0] odls:default:fork binding child [[9045,1],0] to cpus 0001
> [nodo1:02843] [[9045,0],0] odls:default:fork binding child [[9045,1],1] to cpus 0002
> [nodo1:02843] [[9045,0],0] odls:default:fork binding child [[9045,1],2] to cpus 0004
> [nodo1:02843] [[9045,0],0] odls:default:fork binding child [[9045,1],3] to cpus 0008
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 4 1.436920 4.236611E-13
> cpd_at_nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -report-bindings -c 5 pi
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 5 slots
> that were requested by the application:

Hmm ,wait, I got this one on a second try (so it's really 1/20)
> cpd_at_nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -report-bindings -c 4 pi
> [nodo1:02934] [[9016,0],0] odls:default:fork binding child [[9016,1],0] to cpus 0001
> [nodo1:02934] [[9016,0],0] odls:default:fork binding child [[9016,1],1] to cpus 0002
> [nodo1:02934] [[9016,0],0] odls:default:fork binding child [[9016,1],2] to cpus 0004
> [nodo1:02934] [[9016,0],0] odls:default:fork binding child [[9016,1],3] to cpus 0008
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 4 2.437651 4.236611E-13
Hmm, yup, I see it: again 1s diff with the expected 1.43. No clue.

The full devel-map for the -c 4 run (making sure with -nooversubscribe) is
> cpd_at_nodo1:~/ompi$ mpirun -hostfile hf1 -bind-to-core -nooversubscribe -report-bindings -display-devel-map -c 4 pi
>
> Map generated by mapping policy: 0402
> Npernode: 0 Oversubscribe allowed: FALSE CPU Lists: FALSE
> Num new daemons: 0 New daemon starting vpid INVALID
> Num nodes: 1
>
> Data for node: Name: nodo1 Launch id: -1 Arch: ffc91200 State: 2
> Num boards: 1 Num sockets/board: 1 Num cores/socket: 4
> Daemon: [[9168,0],0] Daemon launched: True
> Num slots: 4 Slots in use: 4
> Num slots allocated: 4 Max slots: 4
> Username on node: NULL
> Num procs: 4 Next node_rank: 4
> Data for proc: [[9168,1],0]
> Pid: 0 Local rank: 0 Node rank: 0
> State: 0 App_context: 0 Slot list: NULL
> Data for proc: [[9168,1],1]
> Pid: 0 Local rank: 1 Node rank: 1
> State: 0 App_context: 0 Slot list: NULL
> Data for proc: [[9168,1],2]
> Pid: 0 Local rank: 2 Node rank: 2
> State: 0 App_context: 0 Slot list: NULL
> Data for proc: [[9168,1],3]
> Pid: 0 Local rank: 3 Node rank: 3
> State: 0 App_context: 0 Slot list: NULL
> [nodo1:02974] [[9168,0],0] odls:default:fork binding child [[9168,1],0] to cpus 0001
> [nodo1:02974] [[9168,0],0] odls:default:fork binding child [[9168,1],1] to cpus 0002
> [nodo1:02974] [[9168,0],0] odls:default:fork binding child [[9168,1],2] to cpus 0004
> [nodo1:02974] [[9168,0],0] odls:default:fork binding child [[9168,1],3] to cpus 0008
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 4 1.452407 4.236611E-13

And this is using the rankfile
> cpd_at_nodo1:~/ompi$ cat rf
> rank 0=n1 slot=0:0
> rank 1=n1 slot=0:1
> rank 2=n1 slot=0:2
> rank 3=n1 slot=0:3
> cpd_at_nodo1:~/ompi$ mpirun -rf rf -report-bindings -c 4 pi
> [nodo1:03004] [[9202,0],0] odls:default:fork binding child [[9202,1],0] to slot_list 0:0
> [nodo1:03004] [[9202,0],0] odls:default:fork binding child [[9202,1],1] to slot_list 0:1
> [nodo1:03004] [[9202,0],0] odls:default:fork binding child [[9202,1],2] to slot_list 0:2
> [nodo1:03004] [[9202,0],0] odls:default:fork binding child [[9202,1],3] to slot_list 0:3
> # Estimation of pi is 3.141593 after 1e+08 iterations
> # PCs Time (s) Error
> 4 1.435335 4.236611E-13

So I thought I'd better ask for help, and following mailing-list
experience, have yet another coffee meanwhile :-)

Are these results normal? Why can't I get results similar to the Pentium
dual on the Core i7? I would have expected to see some 0.84s with -c 2
every now and then, and very frequently if using -bind-to-core. I
realize I don't have a clear concept of when/wheter hostfile/slots are
required, or if -num-cores with -bind-to-core should have worked fine.
Is there any (mpirun switch-mca parameter) I'm missing that would
provide 10/10 results in the Pentium dual, and/or let the pi application
scale well on the Core i7?

Thanks in advance for any suggestions.