Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Running OpenMPI on SGI Altix with 4096 cores: very poor performance
From: Gilbert Grosdidier (Gilbert.Grosdidier_at_[hidden])
Date: 2010-12-22 00:09:10


There are indeed a high rate of communications. But the buffer
size is always the same for a given pair of processes, and I thought
that mpi_leave_pinned should avoid freeing the memory in this case.
Am I wrong ?

  Thanks, Best, G.

Le 21/12/2010 18:52, Matthieu Brucher a écrit :
> Don't forget that MPT has some optimizations OpenMPI may not have, as
> "overriding" free(). This way, MPT can have a huge performance boost
> if you're allocating and freeing memory, and the same happens if you
> communicate often.
>
> Matthieu
>
> 2010/12/21 Gilbert Grosdidier<Gilbert.Grosdidier_at_[hidden]>:
>> Hi George,
>> Thanks for your help. The bottom line is that the processes are neatly
>> placed on the nodes/cores,
>> as far as I can tell from the map :
>> [...]
>> Process OMPI jobid: [33285,1] Process rank: 4
>> Process OMPI jobid: [33285,1] Process rank: 5
>> Process OMPI jobid: [33285,1] Process rank: 6
>> Process OMPI jobid: [33285,1] Process rank: 7
>> Data for node: Name: r34i0n1 Num procs: 8
>> Process OMPI jobid: [33285,1] Process rank: 8
>> Process OMPI jobid: [33285,1] Process rank: 9
>> Process OMPI jobid: [33285,1] Process rank: 10
>> Process OMPI jobid: [33285,1] Process rank: 11
>> Process OMPI jobid: [33285,1] Process rank: 12
>> Process OMPI jobid: [33285,1] Process rank: 13
>> Process OMPI jobid: [33285,1] Process rank: 14
>> Process OMPI jobid: [33285,1] Process rank: 15
>> Data for node: Name: r34i0n2 Num procs: 8
>> Process OMPI jobid: [33285,1] Process rank: 16
>> Process OMPI jobid: [33285,1] Process rank: 17
>> Process OMPI jobid: [33285,1] Process rank: 18
>> Process OMPI jobid: [33285,1] Process rank: 19
>> Process OMPI jobid: [33285,1] Process rank: 20
>> [...]
>> But the perfs are still very low ;-(
>> Best, G.
>> Le 20 déc. 10 à 22:27, George Bosilca a écrit :
>>
>> That's a first step. My question was more related to the process overlay on
>> the cores. If the MPI implementation place one process per node, then rank k
>> and rank k+1 will always be on separate node, and the communications will
>> have to go over IB. In the opposite if the MPI implementation places the
>> processes per core, then rank k and k+1 will [mostly] be on the same node
>> and the communications will be over shared memory. Depending on how the
>> processes are placed and how you create the neighborhoods the performance
>> can be drastically impacted.
>>
>> There is a pretty good description of the problem at:
>> http://www.hpccommunity.org/f55/behind-scenes-mpi-process-placement-640/
>>
>> Some hints at
>> http://www.open-mpi.org/faq/?category=running#mpirun-scheduling. I suggest
>> you play with the --byslot --bynode options to see how this affect the
>> performance of your application.
>>
>> For the hardcore cases we provide a rankfile feature. More info at:
>> http://www.open-mpi.org/faq/?category=tuning#using-paffinity
>>
>> Enjoy,
>> george.
>>
>>
>>
>> On Dec 20, 2010, at 15:45 , Gilbert Grosdidier wrote:
>>
>> Yes, there is definitely only 1 process per core with both MPI
>> implementations.
>>
>> Thanks, G.
>>
>>
>> Le 20/12/2010 20:39, George Bosilca a écrit :
>>
>> Are your processes places the same way with the two MPI implementations?
>> Per-node vs. per-core ?
>>
>> george.
>>
>> On Dec 20, 2010, at 11:14 , Gilbert Grosdidier wrote:
>>
>> Bonjour,
>>
>> I am now at a loss with my running of OpenMPI (namely 1.4.3)
>>
>> on a SGI Altix cluster with 2048 or 4096 cores, running over Infiniband.
>>
>> After fixing several rather obvious failures with Ralph, Jeff and John help,
>>
>> I am now facing the bottom of this story since :
>>
>> - there are no more obvious failures with messages
>>
>> - compared to the running of the application with SGI-MPT, the CPU
>> performances I get
>>
>> are very low, decreasing when the number of cores increases (cf below)
>>
>> - these performances are highly reproducible
>>
>> - I tried a very high number of -mca parameters, to no avail
>>
>> If I take as a reference the MPT CPU speed performance,
>>
>> it is of about 900 (in some arbitrary unit), whatever the
>>
>> number of cores I used (up to 8192).
>>
>> But, when running with OMPI, I get:
>>
>> - 700 with 1024 cores (which is already rather low)
>>
>> - 300 with 2048 cores
>>
>> - 60 with 4096 cores.
>>
>> The computing loop, over which the above CPU performance is evaluated,
>> includes
>>
>> a stack of MPI exchanges [per core : 8 x (MPI_Isend + MPI_Irecv) +
>> MPI_Waitall]
>>
>> The application is of the 'domain partition' type,
>>
>> and the performances, together with the memory footprint,
>>
>> are very identical on all cores. The memory footprint is twice higher in
>>
>> the OMPI case (1.5GB/core) than in the MPT case (0.7GB/core).
>>
>> What could be wrong with all these, please ?
>>
>> I provided (in attachment) the 'ompi_info -all ' output.
>>
>> The config.log is in attachment as well.
>>
>> I compiled OMPI with icc. I checked numa and affinity are OK.
>>
>> I use the following command to run my OMPI app:
>>
>> "mpiexec -mca btl_openib_rdma_pipeline_send_length 65536\
>>
>> -mca btl_openib_rdma_pipeline_frag_size 65536\
>>
>> -mca btl_openib_min_rdma_pipeline_size 65536\
>>
>> -mca btl_self_rdma_pipeline_send_length 262144\
>>
>> -mca btl_self_rdma_pipeline_frag_size 262144\
>>
>> -mca plm_rsh_num_concurrent 4096 -mca mpi_paffinity_alone 1\
>>
>> -mca mpi_leave_pinned 1 -mca btl_sm_max_send_size 128\
>>
>> -mca coll_tuned_pre_allocate_memory_comm_size_limit 128\
>>
>> -mca btl_openib_cq_size 128 -mca btl_ofud_rd_num 128\
>>
>> -mca mpool_rdma_rcache_size_limit 131072 -mca mpi_preconnect_mpi 0\
>>
>> -mca mpool_sm_min_size 131072 -mca mpi_abort_print_stack 1\
>>
>> -mca btl sm,openib,self -mca btl_openib_want_fork_support 0\
>>
>> -mca opal_set_max_sys_limits 1 -mca osc_pt2pt_no_locks 1\
>>
>> -mca osc_rdma_no_locks 1\
>>
>> $PBS_JOBDIR/phmc_tm_p2.$PBS_JOBID -v -f $Jinput".
>>
>> OpenIB info:
>>
>> 1) OFED-1.4.1, installed by SGI SGI
>>
>> 2) Linux xxxxxx 2.6.16.60-0.42.10-smp #1 SMP Tue Apr 27 05:11:27 UTC 2010
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>> OS : SGI ProPack 6SP5 for Linux, Build 605r1.sles10-0909302200
>>
>> 3) Running most probably an SGI subnet manager
>>
>> 4)> ibv_devinfo (on a worker node)
>>
>> hca_id: mlx4_0
>>
>> fw_ver: 2.7.000
>>
>> node_guid: 0030:48ff:ffcc:4c44
>>
>> sys_image_guid: 0030:48ff:ffcc:4c47
>>
>> vendor_id: 0x02c9
>>
>> vendor_part_id: 26418
>>
>> hw_ver: 0xA0
>>
>> board_id: SM_2071000001000
>>
>> phys_port_cnt: 2
>>
>> port: 1
>>
>> state: PORT_ACTIVE (4)
>>
>> max_mtu: 2048 (4)
>>
>> active_mtu: 2048 (4)
>>
>> sm_lid: 1
>>
>> port_lid: 6009
>>
>> port_lmc: 0x00
>>
>> port: 2
>>
>> state: PORT_ACTIVE (4)
>>
>> max_mtu: 2048 (4)
>>
>> active_mtu: 2048 (4)
>>
>> sm_lid: 1
>>
>> port_lid: 6010
>>
>> port_lmc: 0x00
>>
>> 5)> ifconfig -a (on a worker node)
>>
>> eth0 Link encap:Ethernet HWaddr 00:30:48:CE:73:30
>>
>> inet adr:192.168.159.10 Bcast:192.168.159.255
>> Masque:255.255.255.0
>>
>> adr inet6: fe80::230:48ff:fece:7330/64 Scope:Lien
>>
>> UP BROADCAST NOTRAILERS RUNNING MULTICAST MTU:1500 Metric:1
>>
>> RX packets:32337499 errors:0 dropped:0 overruns:0 frame:0
>>
>> TX packets:34733462 errors:0 dropped:0 overruns:0 carrier:0
>>
>> collisions:0 lg file transmission:1000
>>
>> RX bytes:11486224753 (10954.1 Mb) TX bytes:16450996864 (15688.8
>> Mb)
>>
>> Mémoire:fbc60000-fbc80000
>>
>> eth1 Link encap:Ethernet HWaddr 00:30:48:CE:73:31
>>
>> BROADCAST MULTICAST MTU:1500 Metric:1
>>
>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>
>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>
>> collisions:0 lg file transmission:1000
>>
>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>>
>> Mémoire:fbce0000-fbd00000
>>
>> ib0 Link encap:UNSPEC HWaddr
>> 80-00-00-48-FE-C0-00-00-00-00-00-00-00-00-00-00
>>
>> inet adr:10.148.9.198 Bcast:10.148.255.255 Masque:255.255.0.0
>>
>> adr inet6: fe80::230:48ff:ffcc:4c45/64 Scope:Lien
>>
>> UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
>>
>> RX packets:115055101 errors:0 dropped:0 overruns:0 frame:0
>>
>> TX packets:5390843 errors:0 dropped:182 overruns:0 carrier:0
>>
>> collisions:0 lg file transmission:256
>>
>> RX bytes:49592870352 (47295.4 Mb) TX bytes:43566897620 (41548.6
>> Mb)
>>
>> ib1 Link encap:UNSPEC HWaddr
>> 80-00-00-49-FE-C0-00-00-00-00-00-00-00-00-00-00
>>
>> inet adr:10.149.9.198 Bcast:10.149.255.255 Masque:255.255.0.0
>>
>> adr inet6: fe80::230:48ff:ffcc:4c46/64 Scope:Lien
>>
>> UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1
>>
>> RX packets:673448 errors:0 dropped:0 overruns:0 frame:0
>>
>> TX packets:187 errors:0 dropped:5 overruns:0 carrier:0
>>
>> collisions:0 lg file transmission:256
>>
>> RX bytes:37713088 (35.9 Mb) TX bytes:11228 (10.9 Kb)
>>
>> lo Link encap:Boucle locale
>>
>> inet adr:127.0.0.1 Masque:255.0.0.0
>>
>> adr inet6: ::1/128 Scope:Hôte
>>
>> UP LOOPBACK RUNNING MTU:16436 Metric:1
>>
>> RX packets:33504149 errors:0 dropped:0 overruns:0 frame:0
>>
>> TX packets:33504149 errors:0 dropped:0 overruns:0 carrier:0
>>
>> collisions:0 lg file transmission:0
>>
>> RX bytes:5100850397 (4864.5 Mb) TX bytes:5100850397 (4864.5 Mb)
>>
>> sit0 Link encap:IPv6-dans-IPv4
>>
>> NOARP MTU:1480 Metric:1
>>
>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>>
>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>>
>> collisions:0 lg file transmission:0
>>
>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
>>
>> 6)> limit (on a worker node)
>>
>> cputime unlimited
>>
>> filesize unlimited
>>
>> datasize unlimited
>>
>> stacksize 300000 kbytes
>>
>> coredumpsize 0 kbytes
>>
>> memoryuse unlimited
>>
>> vmemoryuse unlimited
>>
>> descriptors 16384
>>
>> memorylocked unlimited
>>
>> maxproc 303104
>>
>> If some info is still missing despite all my efforts, please ask.
>>
>> Thanks in advance for any hints, Best, G.