Don't forget that MPT has some optimizations OpenMPI may not have, as
"overriding" free(). This way, MPT can have a huge performance boost
if you're allocating and freeing memory, and the same happens if you
communicate often.
Matthieu
2010/12/21 Gilbert Grosdidier <Gilbert.Grosdidier_at_[hidden]>:
> Hi George,
> Thanks for your help. The bottom line is that the processes are neatly
> placed on the nodes/cores,
> as far as I can tell from the map :
> [...]
> Process OMPI jobid: [33285,1] Process rank: 4
> Process OMPI jobid: [33285,1] Process rank: 5
> Process OMPI jobid: [33285,1] Process rank: 6
> Process OMPI jobid: [33285,1] Process rank: 7
> Data for node: Name: r34i0n1 Num procs: 8
> Process OMPI jobid: [33285,1] Process rank: 8
> Process OMPI jobid: [33285,1] Process rank: 9
> Process OMPI jobid: [33285,1] Process rank: 10
> Process OMPI jobid: [33285,1] Process rank: 11
> Process OMPI jobid: [33285,1] Process rank: 12
> Process OMPI jobid: [33285,1] Process rank: 13
> Process OMPI jobid: [33285,1] Process rank: 14
> Process OMPI jobid: [33285,1] Process rank: 15
> Data for node: Name: r34i0n2 Num procs: 8
> Process OMPI jobid: [33285,1] Process rank: 16
> Process OMPI jobid: [33285,1] Process rank: 17
> Process OMPI jobid: [33285,1] Process rank: 18
> Process OMPI jobid: [33285,1] Process rank: 19
> Process OMPI jobid: [33285,1] Process rank: 20
> [...]
> But the perfs are still very low ;-(
> Best, G.
> Le 20 déc. 10 à 22:27, George Bosilca a écrit :
>
> That's a first step. My question was more related to the process overlay on
> the cores. If the MPI implementation place one process per node, then rank k
> and rank k+1 will always be on separate node, and the communications will
> have to go over IB. In the opposite if the MPI implementation places the
> processes per core, then rank k and k+1 will [mostly] be on the same node
> and the communications will be over shared memory. Depending on how the
> processes are placed and how you create the neighborhoods the performance
> can be drastically impacted.
>
> There is a pretty good description of the problem at:
> http://www.hpccommunity.org/f55/behind-scenes-mpi-process-placement-640/
>
> Some hints at
> http://www.open-mpi.org/faq/?category=running#mpirun-scheduling. I suggest
> you play with the --byslot --bynode options to see how this affect the
> performance of your application.
>
> For the hardcore cases we provide a rankfile feature. More info at:
> http://www.open-mpi.org/faq/?category=tuning#using-paffinity
>
> Enjoy,
> george.
>
>
>
> On Dec 20, 2010, at 15:45 , Gilbert Grosdidier wrote:
>
> Yes, there is definitely only 1 process per core with both MPI
> implementations.
>
> Thanks, G.
>
>
> Le 20/12/2010 20:39, George Bosilca a écrit :
>
> Are your processes places the same way with the two MPI implementations?
> Per-node vs. per-core ?
>
> george.
>
> On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote:
>
> Bonjour,
>
> I am now at a loss with my running of OpenMPI (namely 1.4.3)
>
> on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband.
>
> After fixing several rather obvious failures with Ralph, Jeff and John help,
>
> I am now facing the bottom of this story since :
>
> - there are no more obvious failures with messages
>
> - compared to the running of the application with SGI-MPT, the CPU
> performances I get
>
> are very low, decreasing when the number of cores increases (cf below)
>
> - these performances are highly reproducible
>
> - I tried a very high number of -mca parameters, to no avail
>
> If I take as a reference the MPT CPU speed performance,
>
> it is of about 900 (in some arbitrary unit), whatever the
>
> number of cores I used (up to 8192).
>
> But, when running with OMPI, I get:
>
> - 700 with 1024 cores (which is already rather low)
>
> - 300 with 2048 cores
>
> - 60 with 4096 cores.
>
> The computing loop, over which the above CPU performance is evaluated,
> includes
>
> a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv) +
> MPI_Waitall]
>
> The application is of the 'domain partition' type,
>
> and the performances, together with the memory footprint,
>
> are very identical on all cores. The memory footprint is twice higher in
>
> the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core).
>
> What could be wrong with all these, please ?
>
> I provided (in attachment) the 'ompi_info -all ' output.
>
> The config.log is in attachment as well.
>
> I compiled OMPI with icc. I checked numa and affinity are OK.
>
> I use the following command to run my OMPI app:
>
> "mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\
>
> -mca btl_openib_rdma_pipeline_frag_size 65536\
>
> -mca btl_openib_min_rdma_pipeline_size 65536\
>
> -mca btl_self_rdma_pipeline_send_length 262144\
>
> -mca btl_self_rdma_pipeline_frag_size 262144\
>
> -mca plm_rsh_num_concurrent 4096 -mca mpi_paffinity_alone 1\
>
> -mca mpi_leave_pinned 1 -mca btl_sm_max_send_size 128\
>
> -mca coll_tuned_pre_allocate_memory_comm_size_limit 128\
>
> -mca btl_openib_cq_size 128 -mca btl_ofud_rd_num 128\
>
> -mca mpool_rdma_rcache_size_limit 131072 -mca mpi_preconnect_mpi 0\
>
> -mca mpool_sm_min_size 131072 -mca mpi_abort_print_stack 1\
>
> -mca btl sm,openib,self -mca btl_openib_want_fork_support 0\
>
> -mca opal_set_max_sys_limits 1 -mca osc_pt2pt_no_locks 1\
>
> -mca osc_rdma_no_locks 1\
>
> $PBS_JOBDIR/phmc_tm_p2.$PBS_JOBID -v -f $Jinput".
>
> OpenIB info:
>
> 1) OFED-1.4.1, installed by SGI SGI
>
> 2) Linux xxxxxx 2.6.16.60-0.42.10-smp #1 SMP Tue Apr 27 05:11:27 UTC 2010
> x86_64 x86_64 x86_64 GNU/Linux
>
> OS : SGI ProPack 6SP5 for Linux, Build 605r1.sles10-0909302200
>
> 3) Running most probably an SGI subnet manager
>
> 4)> ibv_devinfo (on a worker node)
>
> hca_id: mlx4_0
>
> fw_ver: 2.7.000
>
> node_guid: 0030:48ff:ffcc:4c44
>
> sys_image_guid: 0030:48ff:ffcc:4c47
>
> vendor_id: 0x02c9
>
> vendor_part_id: 26418
>
> hw_ver: 0xA0
>
> board_id: SM_2071000001000
>
> phys_port_cnt: 2
>
> port: 1
>
> state: PORT_ACTIVE (4)
>
> max_mtu: 2048 (4)
>
> active_mtu: 2048 (4)
>
> sm_lid: 1
>
> port_lid: 6009
>
> port_lmc: 0x00
>
> port: 2
>
> state: PORT_ACTIVE (4)
>
> max_mtu: 2048 (4)
>
> active_mtu: 2048 (4)
>
> sm_lid: 1
>
> port_lid: 6010
>
> port_lmc: 0x00
>
> 5)> ifconfig -a (on a worker node)
>
> eth0 Link encap:Ethernet HWaddr 00:30:48:CE:73:30
>
> inet adr:192.168.159.10 Bcast:192.168.159.255
> Masque:255.255.255.0
>
> adr inet6: fe80::230:48ff:fece:7330/64 Scope:Lien
>
> UP BROADCAST NOTRAILERS RUNNING MULTICAST MTU:1500 Metric:1
>
> RX packets:32337499 errors:0 dropped:0 overruns:0 frame:0
>
> TX packets:34733462 errors:0 dropped:0 overruns:0 carrier:0
>
> collisions:0 lg file transmission:1000
>
> RX bytes:11486224753 (10954.1 Mb) TX bytes:16450996864 (15688.8
> Mb)
>
> Mémoire:fbc60000-fbc80000
>
> eth1 Link encap:Ethernet HWaddr 00:30:48:CE:73:31
>
> BROADCAST MULTICAST MTU:1500 Metric:1
>
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>
> collisions:0 lg file transmission:1000
>
> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>
> Mémoire:fbce0000-fbd00000
>
> ib0 Link encap:UNSPEC HWaddr
> 80-00-00-48-FE-C0-00-00-00-00-00-00-00-00-00-00
>
> inet adr:10.148.9.198 Bcast:10.148.255.255 Masque:255.255.0.0
>
> adr inet6: fe80::230:48ff:ffcc:4c45/64 Scope:Lien
>
> UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
>
> RX packets:115055101 errors:0 dropped:0 overruns:0 frame:0
>
> TX packets:5390843 errors:0 dropped:182 overruns:0 carrier:0
>
> collisions:0 lg file transmission:256
>
> RX bytes:49592870352 (47295.4 Mb) TX bytes:43566897620 (41548.6
> Mb)
>
> ib1 Link encap:UNSPEC HWaddr
> 80-00-00-49-FE-C0-00-00-00-00-00-00-00-00-00-00
>
> inet adr:10.149.9.198 Bcast:10.149.255.255 Masque:255.255.0.0
>
> adr inet6: fe80::230:48ff:ffcc:4c46/64 Scope:Lien
>
> UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
>
> RX packets:673448 errors:0 dropped:0 overruns:0 frame:0
>
> TX packets:187 errors:0 dropped:5 overruns:0 carrier:0
>
> collisions:0 lg file transmission:256
>
> RX bytes:37713088 (35.9 Mb) TX bytes:11228 (10.9 Kb)
>
> lo Link encap:Boucle locale
>
> inet adr:127.0.0.1 Masque:255.0.0.0
>
> adr inet6: ::1/128 Scope:Hôte
>
> UP LOOPBACK RUNNING MTU:16436 Metric:1
>
> RX packets:33504149 errors:0 dropped:0 overruns:0 frame:0
>
> TX packets:33504149 errors:0 dropped:0 overruns:0 carrier:0
>
> collisions:0 lg file transmission:0
>
> RX bytes:5100850397 (4864.5 Mb) TX bytes:5100850397 (4864.5 Mb)
>
> sit0 Link encap:IPv6-dans-IPv4
>
> NOARP MTU:1480 Metric:1
>
> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>
> collisions:0 lg file transmission:0
>
> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>
> 6)> limit (on a worker node)
>
> cputime unlimited
>
> filesize unlimited
>
> datasize unlimited
>
> stacksize 300000 kbytes
>
> coredumpsize 0 kbytes
>
> memoryuse unlimited
>
> vmemoryuse unlimited
>
> descriptors 16384
>
> memorylocked unlimited
>
> maxproc 303104
>
> If some info is still missing despite all my efforts, please ask.
>
> Thanks in advance for any hints, Best, G.
>
>
> <config.log.gz><ompi_info-all.001.gz>_______________________________________________
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
|