Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Bad parallel scaling using Code Saturne with openmpi
From: Dugenoux Albert (dugenouxa_at_[hidden])
Date: 2012-07-17 09:19:41


Hello. As I promised, I send you results about different simulations and parameters according to the MPI options :TEST | DESCRIPTION                                                   SHARING                MPI | WITH PBS | ELAPSE TIME 1ST ITERATION 1          Node 2                                                                     12 process              yes    no                 0.21875E+03 2          Node 1                                                                     12 process              yes    no                 0.21957E+03 3          Node 1, with 24 process to test multithreadin             24 process              yes    no                 0.20613E+03 4          Node 2                                                                     12 process              yes    yes                0.22130E+03 5          Node 2, with 24 process to test multithreadin              24 process              yes    no                 0.27300E+03 6 7          Nodes 1, 2                                                                 2 x 6 process          yes    yes                0.17304E+03 8          Nodes 1, 2                                                                 2 x 11 process        yes    yes                0.12395E+03 9          Nodes 1, 2                                                                 2 x 12 process        yes    yes                0.11812E+03 10        Nodes 3, 4                                                                 2 x 12 process        yes    yes                0.11237E+03 11        Nodes 1,2,3 with 1 more process upon node 3           2 x 12 + 1 proces    yes    yes                0.56223E+03 12        Nodes 1,2,3;MPI options --bycore --bind-to-core     2 x 12 + 1 proces    yes    yes                0.32452E+03 13        Nodes 1,4,3 with 1 more process upon node 3           2 x 12 + 1 proces    yes    yes                0.37252E+03 14        Nodes 1,4,3;MPI options --bysocket --bind-to-sock 2 x 12 + 1 proces    yes    yes                0.56666E+03 15        Nodes 1,4,3;MPI options --bycore --bind-to-core     2 x 12 + 1 proces    yes    yes                0.39983E+03 16        Nodes 2,3,4                                                               3 x 12 process        yes    yes                 0.85723E+03 17        Nodes 2,3,4                                                               3 x 8 process          yes    yes                 0.49378E+03 18        Nodes 1,2,3                                                               3 x 8 process          yes    yes                 0.51863E+03 19        Nodes 1,2,3,4                                                            4 x 6 process          yes    yes                 0.73272E+03 20 21        1,2,3,4; MPI options --bysocket --bind-to-socke       4 x 6 process          yes    yes                 0.67739E+03 22        1,2,3,4; MPI options --bycore --bind-to-core            4 x 6 process          yes     yes                 0.69612E+03  The more surprising, even by taking in account latency between the nodes, are the tests 11 to 15. By adding only 1 process on the node 3, elapse time becomes 0.56e+03, i.e. 5 times the case 9 and 10. When partitioning upon 25 processors : 1 node represents 4% of the simulation (I have verified each partitions : they contain approximatively the sames number of elements plus or minus 8%), even one takes in account a latency factor of 10, i.e 40% more, one should obtain (for test 10) : 0.11e+03 x 1.40 ~= 0.154e+03 sec.   In addition, when I observe the data transfers upon the eth0 connexion during an iteration, I see that when node 1 and 2 transfer, for example 5 Mo, then node 3 transfers 2,5 Mo. But if we consider that node 3 is concerned by 4% of the data simulation, it should only need 200 Ko !   Results are very differents too, between options with binding socket or binding core, as tests 13, 14 and 15 show.   Regards. Albert