Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance
From: Gilbert Grosdidier (Gilbert.Grosdidier_at_[hidden])
Date: 2010-12-22 00:09:10


There are indeed a high rate of communications. But the buffer
size is always the same for a given pair of processes, and I thought
that mpi_leave_pinned should avoid freeing the memory in this case.
Am I wrong ?

  Thanks, Best, G.

Le 21/12/2010 18:52, Matthieu Brucher a écrit :
> Don't forget that MPT has some optimizations OpenMPI may not have, as
> "overriding" free(). This way, MPT can have a huge performance boost
> if you're allocating and freeing memory, and the same happens if you
> communicate often.
>
> Matthieu
>
> 2010/12/21 Gilbert Grosdidier<Gilbert.Grosdidier_at_[hidden]>:
>> Hi George,
>> Thanks for your help. The bottom line is that the processes are neatly
>> placed on the nodes/cores,
>> as far as I can tell from the map :
>> [...]
>> Process OMPI jobid: [33285,1] Process rank: 4
>> Process OMPI jobid: [33285,1] Process rank: 5
>> Process OMPI jobid: [33285,1] Process rank: 6
>> Process OMPI jobid: [33285,1] Process rank: 7
>> Data for node: Name: r34i0n1 Num procs: 8
>> Process OMPI jobid: [33285,1] Process rank: 8
>> Process OMPI jobid: [33285,1] Process rank: 9
>> Process OMPI jobid: [33285,1] Process rank: 10
>> Process OMPI jobid: [33285,1] Process rank: 11
>> Process OMPI jobid: [33285,1] Process rank: 12
>> Process OMPI jobid: [33285,1] Process rank: 13
>> Process OMPI jobid: [33285,1] Process rank: 14
>> Process OMPI jobid: [33285,1] Process rank: 15
>> Data for node: Name: r34i0n2 Num procs: 8
>> Process OMPI jobid: [33285,1] Process rank: 16
>> Process OMPI jobid: [33285,1] Process rank: 17
>> Process OMPI jobid: [33285,1] Process rank: 18
>> Process OMPI jobid: [33285,1] Process rank: 19
>> Process OMPI jobid: [33285,1] Process rank: 20
>> [...]
>> But the perfs are still very low ;-(
>> Best, G.
>> Le 20 déc. 10 à 22:27, George Bosilca a écrit :
>>
>> That's a first step. My question was more related to the process overlay on
>> the cores. If the MPI implementation place one process per node, then rank k
>> and rank k+1 will always be on separate node, and the communications will
>> have to go over IB. In the opposite if the MPI implementation places the
>> processes per core, then rank k and k+1 will [mostly] be on the same node
>> and the communications will be over shared memory. Depending on how the
>> processes are placed and how you create the neighborhoods the performance
>> can be drastically impacted.
>>
>> There is a pretty good description of the problem at:
>> http://www.hpccommunity.org/f55/behind-scenes-mpi-process-placement-640/
>>
>> Some hints at
>> http://www.open-mpi.org/faq/?category=running#mpirun-scheduling. I suggest
>> you play with the --byslot --bynode options to see how this affect the
>> performance of your application.
>>
>> For the hardcore cases we provide a rankfile feature. More info at:
>> http://www.open-mpi.org/faq/?category=tuning#using-paffinity
>>
>> Enjoy,
>> george.
>>
>>
>>
>> On Dec 20, 2010, at 15:45 , Gilbert Grosdidier wrote:
>>
>> Yes, there is definitely only 1 process per core with both MPI
>> implementations.
>>
>> Thanks, G.
>>
>>
>> Le 20/12/2010 20:39, George Bosilca a écrit :
>>
>> Are your processes places the same way with the two MPI implementations?
>> Per-node vs. per-core ?
>>
>> george.
>>
>> On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote:
>>
>> Bonjour,
>>
>> I am now at a loss with my running of OpenMPI (namely 1.4.3)
>>
>> on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband.
>>
>> After fixing several rather obvious failures with Ralph, Jeff and John help,
>>
>> I am now facing the bottom of this story since :
>>
>> - there are no more obvious failures with messages
>>
>> - compared to the running of the application with SGI-MPT, the CPU
>> performances I get
>>
>> are very low, decreasing when the number of cores increases (cf below)
>>
>> - these performances are highly reproducible
>>
>> - I tried a very high number of -mca parameters, to no avail
>>
>> If I take as a reference the MPT CPU speed performance,
>>
>> it is of about 900 (in some arbitrary unit), whatever the
>>
>> number of cores I used (up to 8192).
>>
>> But, when running with OMPI, I get:
>>
>> - 700 with 1024 cores (which is already rather low)
>>
>> - 300 with 2048 cores
>>
>> - 60 with 4096 cores.
>>
>> The computing loop, over which the above CPU performance is evaluated,
>> includes
>>
>> a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv) +
>> MPI_Waitall]
>>
>> The application is of the 'domain partition' type,
>>
>> and the performances, together with the memory footprint,
>>
>> are very identical on all cores. The memory footprint is twice higher in
>>
>> the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core).
>>
>> What could be wrong with all these, please ?
>>
>> I provided (in attachment) the 'ompi_info -all ' output.
>>
>> The config.log is in attachment as well.
>>
>> I compiled OMPI with icc. I checked numa and affinity are OK.
>>
>> I use the following command to run my OMPI app:
>>
>> "mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\
>>
>> -mca btl_openib_rdma_pipeline_frag_size 65536\
>>
>> -mca btl_openib_min_rdma_pipeline_size 65536\
>>
>> -mca btl_self_rdma_pipeline_send_length 262144\
>>
>> -mca btl_self_rdma_pipeline_frag_size 262144\
>>
>> -mca plm_rsh_num_concurrent 4096 -mca mpi_paffinity_alone 1\
>>
>> -mca mpi_leave_pinned 1 -mca btl_sm_max_send_size 128\
>>
>> -mca coll_tuned_pre_allocate_memory_comm_size_limit 128\
>>
>> -mca btl_openib_cq_size 128 -mca btl_ofud_rd_num 128\
>>
>> -mca mpool_rdma_rcache_size_limit 131072 -mca mpi_preconnect_mpi 0\
>>
>> -mca mpool_sm_min_size 131072 -mca mpi_abort_print_stack 1\
>>
>> -mca btl sm,openib,self -mca btl_openib_want_fork_support 0\
>>
>> -mca opal_set_max_sys_limits 1 -mca osc_pt2pt_no_locks 1\
>>
>> -mca osc_rdma_no_locks 1\
>>
>> $PBS_JOBDIR/phmc_tm_p2.$PBS_JOBID -v -f $Jinput".
>>
>> OpenIB info:
>>
>> 1) OFED-1.4.1, installed by SGI SGI
>>
>> 2) Linux xxxxxx 2.6.16.60-0.42.10-smp #1 SMP Tue Apr 27 05:11:27 UTC 2010
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>> OS : SGI ProPack 6SP5 for Linux, Build 605r1.sles10-0909302200
>>
>> 3) Running most probably an SGI subnet manager
>>
>> 4)> ibv_devinfo (on a worker node)
>>
>> hca_id: mlx4_0
>>
>> fw_ver: 2.7.000
>>
>> node_guid: 0030:48ff:ffcc:4c44
>>
>> sys_image_guid: 0030:48ff:ffcc:4c47
>>
>> vendor_id: 0x02c9
>>
>> vendor_part_id: 26418
>>
>> hw_ver: 0xA0
>>
>> board_id: SM_2071000001000
>>
>> phys_port_cnt: 2
>>
>> port: 1
>>
>> state: PORT_ACTIVE (4)
>>
>> max_mtu: 2048 (4)
>>
>> active_mtu: 2048 (4)
>>
>> sm_lid: 1
>>
>> port_lid: 6009
>>
>> port_lmc: 0x00
>>
>> port: 2
>>
>> state: PORT_ACTIVE (4)
>>
>> max_mtu: 2048 (4)
>>
>> active_mtu: 2048 (4)
>>
>> sm_lid: 1
>>
>> port_lid: 6010
>>
>> port_lmc: 0x00
>>
>> 5)> ifconfig -a (on a worker node)
>>
>> eth0 Link encap:Ethernet HWaddr 00:30:48:CE:73:30
>>
>> inet adr:192.168.159.10 Bcast:192.168.159.255
>> Masque:255.255.255.0
>>
>> adr inet6: fe80::230:48ff:fece:7330/64 Scope:Lien
>>
>> UP BROADCAST NOTRAILERS RUNNING MULTICAST MTU:1500 Metric:1
>>
>> RX packets:32337499 errors:0 dropped:0 overruns:0 frame:0
>>
>> TX packets:34733462 errors:0 dropped:0 overruns:0 carrier:0
>>
>> collisions:0 lg file transmission:1000
>>
>> RX bytes:11486224753 (10954.1 Mb) TX bytes:16450996864 (15688.8
>> Mb)
>>
>> Mémoire:fbc60000-fbc80000
>>
>> eth1 Link encap:Ethernet HWaddr 00:30:48:CE:73:31
>>
>> BROADCAST MULTICAST MTU:1500 Metric:1
>>
>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>
>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>
>> collisions:0 lg file transmission:1000
>>
>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>>
>> Mémoire:fbce0000-fbd00000
>>
>> ib0 Link encap:UNSPEC HWaddr
>> 80-00-00-48-FE-C0-00-00-00-00-00-00-00-00-00-00
>>
>> inet adr:10.148.9.198 Bcast:10.148.255.255 Masque:255.255.0.0
>>
>> adr inet6: fe80::230:48ff:ffcc:4c45/64 Scope:Lien
>>
>> UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
>>
>> RX packets:115055101 errors:0 dropped:0 overruns:0 frame:0
>>
>> TX packets:5390843 errors:0 dropped:182 overruns:0 carrier:0
>>
>> collisions:0 lg file transmission:256
>>
>> RX bytes:49592870352 (47295.4 Mb) TX bytes:43566897620 (41548.6
>> Mb)
>>
>> ib1 Link encap:UNSPEC HWaddr
>> 80-00-00-49-FE-C0-00-00-00-00-00-00-00-00-00-00
>>
>> inet adr:10.149.9.198 Bcast:10.149.255.255 Masque:255.255.0.0
>>
>> adr inet6: fe80::230:48ff:ffcc:4c46/64 Scope:Lien
>>
>> UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
>>
>> RX packets:673448 errors:0 dropped:0 overruns:0 frame:0
>>
>> TX packets:187 errors:0 dropped:5 overruns:0 carrier:0
>>
>> collisions:0 lg file transmission:256
>>
>> RX bytes:37713088 (35.9 Mb) TX bytes:11228 (10.9 Kb)
>>
>> lo Link encap:Boucle locale
>>
>> inet adr:127.0.0.1 Masque:255.0.0.0
>>
>> adr inet6: ::1/128 Scope:Hôte
>>
>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>>
>> RX packets:33504149 errors:0 dropped:0 overruns:0 frame:0
>>
>> TX packets:33504149 errors:0 dropped:0 overruns:0 carrier:0
>>
>> collisions:0 lg file transmission:0
>>
>> RX bytes:5100850397 (4864.5 Mb) TX bytes:5100850397 (4864.5 Mb)
>>
>> sit0 Link encap:IPv6-dans-IPv4
>>
>> NOARP MTU:1480 Metric:1
>>
>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>
>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>
>> collisions:0 lg file transmission:0
>>
>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>>
>> 6)> limit (on a worker node)
>>
>> cputime unlimited
>>
>> filesize unlimited
>>
>> datasize unlimited
>>
>> stacksize 300000 kbytes
>>
>> coredumpsize 0 kbytes
>>
>> memoryuse unlimited
>>
>> vmemoryuse unlimited
>>
>> descriptors 16384
>>
>> memorylocked unlimited
>>
>> maxproc 303104
>>
>> If some info is still missing despite all my efforts, please ask.
>>
>> Thanks in advance for any hints, Best, G.