Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] OMPI 1.4.3 hangs in gather
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-01-13 14:15:58


+1 on what Pasha said -- if using rdmacm fixes the problem, then there's something else nefarious going on...

You might want to check padb with your hangs to see where all the processes are hung to see if anything obvious jumps out. I'd be surprised if there's a bug in the oob cpc; it's been around for a long, long time; it should be pretty stable.

Do we create QP's differently between oob and rdmacm, such that perhaps they are "better" (maybe better routed, or using a different SL, or ...) when created via rdmacm?

On Jan 12, 2011, at 12:12 PM, Shamis, Pavel wrote:

> RDMACM or OOB can not effect on performance of this benchmark, since they are not involved in communication. So I'm not sure that the performance changes that you see are related to connection manager changes.
> About oob - I'm not aware about hangs issue there, the code is very-very old, we did not touch it for a long time.
>
> Regards,
>
> Pavel (Pasha) Shamis
> ---
> Application Performance Tools Group
> Computer Science and Math Division
> Oak Ridge National Laboratory
> Email: shamisp_at_[hidden]
>
>
>
>
>
> On Jan 12, 2011, at 8:45 AM, Doron Shoham wrote:
>
>> Hi,
>>
>> For the first problem, I can see that when using rdmacm as openib oob
>> I get much better performence results (and no hangs!).
>>
>> mpirun -display-map -np 64 -machinefile voltairenodes -mca btl
>> sm,self,openib -mca btl_openib_connect_rdmacm_priority 100
>> imb/src/IMB-MPI1 gather -npmin 64
>>
>>
>> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
>>
>> 0 1000 0.04 0.05 0.05
>>
>> 1 1000 19.64 19.69 19.67
>>
>> 2 1000 19.97 20.02 19.99
>>
>> 4 1000 21.86 21.96 21.89
>>
>> 8 1000 22.87 22.94 22.90
>>
>> 16 1000 24.71 24.80 24.76
>>
>> 32 1000 27.23 27.32 27.27
>>
>> 64 1000 30.96 31.06 31.01
>>
>> 128 1000 36.96 37.08 37.02
>>
>> 256 1000 42.64 42.79 42.72
>>
>> 512 1000 60.32 60.59 60.46
>>
>> 1024 1000 82.44 82.74 82.59
>>
>> 2048 1000 497.66 499.62 498.70
>>
>> 4096 1000 684.15 686.47 685.33
>>
>> 8192 519 544.07 546.68 545.85
>>
>> 16384 519 653.20 656.23 655.27
>>
>> 32768 519 704.48 707.55 706.60
>>
>> 65536 519 918.00 922.12 920.86
>>
>> 131072 320 2414.08 2422.17 2418.20
>>
>> 262144 160 4198.25 4227.58 4213.19
>>
>> 524288 80 7333.04 7503.99 7438.18
>>
>> 1048576 40 13692.60 14150.20 13948.75
>>
>> 2097152 20 30377.34 32679.15 31779.86
>>
>> 4194304 10 61416.70 71012.50 68380.04
>>
>> How can the oob cause the hang? Isn't it only used to bring up the connection?
>> Does the oob has any part of the connections were made?
>>
>> Thanks,
>> Dororn
>>
>> On Tue, Jan 11, 2011 at 2:58 PM, Doron Shoham <doron.ompi_at_[hidden]> wrote:
>>>
>>> Hi
>>>
>>> All machines on the setup are IDataPlex with Nehalem 12 cores per node, 24GB memory.
>>>
>>>
>>>
>>> · Problem 1 – OMPI 1.4.3 hangs in gather:
>>>
>>>
>>>
>>> I’m trying to run IMB and gather operation with OMPI 1.4.3 (Vanilla).
>>>
>>> It happens when np >= 64 and message size exceed 4k:
>>>
>>> mpirun -np 64 -machinefile voltairenodes -mca btl sm,self,openib imb/src-1.4.2/IMB-MPI1 gather –npmin 64
>>>
>>>
>>>
>>> voltairenodes consists of 64 machines.
>>>
>>>
>>>
>>> #----------------------------------------------------------------
>>>
>>> # Benchmarking Gather
>>>
>>> # #processes = 64
>>>
>>> #----------------------------------------------------------------
>>>
>>> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
>>>
>>> 0 1000 0.02 0.02 0.02
>>>
>>> 1 331 14.02 14.16 14.09
>>>
>>> 2 331 12.87 13.08 12.93
>>>
>>> 4 331 14.29 14.43 14.34
>>>
>>> 8 331 16.03 16.20 16.11
>>>
>>> 16 331 17.54 17.74 17.64
>>>
>>> 32 331 20.49 20.62 20.53
>>>
>>> 64 331 23.57 23.84 23.70
>>>
>>> 128 331 28.02 28.35 28.18
>>>
>>> 256 331 34.78 34.88 34.80
>>>
>>> 512 331 46.34 46.91 46.60
>>>
>>> 1024 331 63.96 64.71 64.33
>>>
>>> 2048 331 460.67 465.74 463.18
>>>
>>> 4096 331 637.33 643.99 640.75
>>>
>>>
>>>
>>> This the padb output:
>>>
>>> padb –A –x –Ormgr=mpirun –tree:
>>>
>>>
>>>
>>> =~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2011.01.06 14:33:17 =~=~=~=~=~=~=~=~=~=~=~=
>>>
>>>
>>>
>>> Warning, remote process state differs across ranks
>>>
>>> state : ranks
>>>
>>> R (running) : [1,3-6,8,10-13,16-20,23-28,30-32,34-42,44-45,47-49,51-53,56-59,61-63]
>>>
>>> S (sleeping) : [0,2,7,9,14-15,21-22,29,33,43,46,50,54-55,60]
>>>
>>> Stack trace(s) for thread: 1
>>>
>>> -----------------
>>>
>>> [0-63] (64 processes)
>>>
>>> -----------------
>>>
>>> main() at ?:?
>>>
>>> IMB_init_buffers_iter() at ?:?
>>>
>>> IMB_gather() at ?:?
>>>
>>> PMPI_Gather() at pgather.c:175
>>>
>>> mca_coll_sync_gather() at coll_sync_gather.c:46
>>>
>>> ompi_coll_tuned_gather_intra_dec_fixed() at coll_tuned_decision_fixed.c:714
>>>
>>> -----------------
>>>
>>> [0,3-63] (62 processes)
>>>
>>> -----------------
>>>
>>> ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:248
>>>
>>> mca_pml_ob1_recv() at pml_ob1_irecv.c:104
>>>
>>> ompi_request_wait_completion() at ../../../../ompi/request/request.h:375
>>>
>>> opal_condition_wait() at ../../../../opal/threads/condition.h:99
>>>
>>> -----------------
>>>
>>> [1] (1 processes)
>>>
>>> -----------------
>>>
>>> ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:302
>>>
>>> mca_pml_ob1_send() at pml_ob1_isend.c:125
>>>
>>> ompi_request_wait_completion() at ../../../../ompi/request/request.h:375
>>>
>>> opal_condition_wait() at ../../../../opal/threads/condition.h:99
>>>
>>> -----------------
>>>
>>> [2] (1 processes)
>>>
>>> -----------------
>>>
>>> ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:315
>>>
>>> ompi_request_default_wait() at request/req_wait.c:37
>>>
>>> ompi_request_wait_completion() at ../ompi/request/request.h:375
>>>
>>> opal_condition_wait() at ../opal/threads/condition.h:99
>>>
>>> Stack trace(s) for thread: 2
>>>
>>> -----------------
>>>
>>> [0-63] (64 processes)
>>>
>>> -----------------
>>>
>>> start_thread() at ?:?
>>>
>>> btl_openib_async_thread() at btl_openib_async.c:344
>>>
>>> poll() at ?:?
>>>
>>> Stack trace(s) for thread: 3
>>>
>>> -----------------
>>>
>>> [0-63] (64 processes)
>>>
>>> -----------------
>>>
>>> start_thread() at ?:?
>>>
>>> service_thread_start() at btl_openib_fd.c:427
>>>
>>> select() at ?:?
>>>
>>> -bash-3.2$
>>>
>>>
>>>
>>>
>>>
>>> When running again padb after couple of minutes, I can see that the total number of processes remain in the same position but
>>>
>>> different processes are at different positions.
>>>
>>> For example, this is the diff between two padb outputs:
>>>
>>>
>>>
>>> Warning, remote process state differs across ranks
>>>
>>> state : ranks
>>>
>>> -R (running) : [0,2-4,6-13,16-18,20-21,28-31,33-36,38-56,58,60,62-63]
>>>
>>> -S (sleeping) : [1,5,14-15,19,22-27,32,37,57,59,61]
>>>
>>> +R (running) : [2,5-14,16-23,25,28-40,42-48,50-51,53-58,61,63]
>>>
>>> +S (sleeping) : [0-1,3-4,15,24,26-27,41,49,52,59-60,62]
>>>
>>> Stack trace(s) for thread: 1
>>>
>>> -----------------
>>>
>>> [0-63] (64 processes)
>>>
>>> @@ -13,21 +13,21 @@
>>>
>>> mca_coll_sync_gather() at coll_sync_gather.c:46
>>>
>>> ompi_coll_tuned_gather_intra_dec_fixed() at coll_tuned_decision_fixed.c:714
>>>
>>> -----------------
>>>
>>> - [0,3-63] (62 processes)
>>>
>>> + [0-5,8-63] (62 processes)
>>>
>>> -----------------
>>>
>>> ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:248
>>>
>>> mca_pml_ob1_recv() at pml_ob1_irecv.c:104
>>>
>>> ompi_request_wait_completion() at ../../../../ompi/request/request.h:375
>>>
>>> opal_condition_wait() at ../../../../opal/threads/condition.h:99
>>>
>>> -----------------
>>>
>>> - [1] (1 processes)
>>>
>>> + [6] (1 processes)
>>>
>>> -----------------
>>>
>>> ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:302
>>>
>>> mca_pml_ob1_send() at pml_ob1_isend.c:125
>>>
>>> ompi_request_wait_completion() at ../../../../ompi/request/request.h:375
>>>
>>> opal_condition_wait() at ../../../../opal/threads/condition.h:99
>>>
>>> -----------------
>>>
>>> - [2] (1 processes)
>>>
>>> + [7] (1 processes)
>>>
>>> -----------------
>>>
>>> ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:315
>>>
>>> ompi_request_default_wait() at request/req_wait.c:37
>>>
>>>
>>>
>>>
>>>
>>> Choosing different gather algorithm seems to bypass the hang.
>>>
>>> I’ve used the following mca parameters:
>>>
>>> --mca coll_tuned_use_dynamic_rules 1
>>>
>>> --mca coll_tuned_gather_algorithm 1
>>>
>>>
>>>
>>> Actually, both dec_fixed and basic_linear works while binomial and linear_sync doesn’t.
>>>
>>>
>>>
>>> With OMPI 1.5 it doesn’t hangs (with all gather algorithms) and it much faster (the number of repetitions is much higher):
>>>
>>> #----------------------------------------------------------------
>>>
>>> # Benchmarking Gather
>>>
>>> # #processes = 64
>>>
>>> #----------------------------------------------------------------
>>>
>>> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
>>>
>>> 0 1000 0.02 0.03 0.02
>>>
>>> 1 1000 18.50 18.55 18.53
>>>
>>> 2 1000 18.17 18.25 18.22
>>>
>>> 4 1000 19.04 19.10 19.07
>>>
>>> 8 1000 19.60 19.67 19.64
>>>
>>> 16 1000 21.39 21.47 21.43
>>>
>>> 32 1000 24.83 24.91 24.87
>>>
>>> 64 1000 27.35 27.45 27.40
>>>
>>> 128 1000 33.23 33.34 33.29
>>>
>>> 256 1000 41.24 41.39 41.32
>>>
>>> 512 1000 52.62 52.81 52.71
>>>
>>> 1024 1000 73.20 73.46 73.32
>>>
>>> 2048 1000 416.36 418.04 417.22
>>>
>>> 4096 1000 638.54 640.70 639.65
>>>
>>> 8192 1000 506.26 506.97 506.63
>>>
>>> 16384 1000 600.63 601.40 601.02
>>>
>>> 32768 1000 639.52 640.34 639.93
>>>
>>> 65536 640 914.22 916.02 915.13
>>>
>>> 131072 320 2287.37 2295.18 2291.35
>>>
>>> 262144 160 4041.36 4070.58 4056.27
>>>
>>> 524288 80 7292.35 7463.27 7397.14
>>>
>>> 1048576 40 13647.15 14107.15 13905.29
>>>
>>> 2097152 20 30625.00 32635.45 31815.36
>>>
>>> 4194304 10 63543.01 70987.49 68680.48
>>>
>>>
>>>
>>>
>>>
>>> · Problem 2 – segmentation fault with OMPI 1.4.3/1.5 and IMB gather np=768:
>>>
>>> When trying to run the same command but with np=768 I get segmentation fault:
>>>
>>> openmpi-1.4.3/bin/mpirun -np 768 -machinefile voltairenodes -mca btl sm,self,openib -mca coll_tuned_use_dynamic_rules 1 -mca coll_tuned_gather_algorithm 1 imb/src/IMB-MPI1 gather -npmin 768 -mem 1.6
>>>
>>>
>>>
>>> This happens in OMPI 1.4.3 and 1.5
>>>
>>>
>>>
>>> [compa163:20249] *** Process received signal ***
>>>
>>> [compa163:20249] Signal: Segmentation fault (11)
>>>
>>> [compa163:20249] Signal code: Address not mapped (1)
>>>
>>> [compa163:20249] Failing at address: 0x2aab4a204000
>>>
>>> [compa163:20249] [ 0] /lib64/libpthread.so.0 [0x366aa0e7c0]
>>>
>>> [compa163:20249] [ 1] /gpfs/asrc/home/voltaire/install//openmpi-1.4.3/lib/libmpi.so.0(ompi_convertor_unpack+0x15f) [0x2b077882282e]
>>>
>>> [compa163:20249] [ 2] /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so [0x2b077b9e1672]
>>>
>>> [compa163:20249] [ 3] /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so [0x2b077b9dd0b6]
>>>
>>> [compa163:20249] [ 4] /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_btl_sm.so [0x2b077c459d87]
>>>
>>> [compa163:20249] [ 5] /gpfs/asrc/home/voltaire/install//openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0xbe) [0x2b0778d845b8]
>>>
>>> [compa163:20249] [ 6] /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so [0x2b077b9d6d62]
>>>
>>> [compa163:20249] [ 7] /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so [0x2b077b9d6ba7]
>>>
>>> [compa163:20249] [ 8] /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so [0x2b077b9d6a90]
>>>
>>> [compa163:20249] [ 9] /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_tuned.so [0x2b077d298dc5]
>>>
>>> [compa163:20249] [10] /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_tuned.so [0x2b077d2990d3]
>>>
>>> [compa163:20249] [11] /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_tuned.so [0x2b077d286e9b]
>>>
>>> [compa163:20249] [12] /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_sync.so [0x2b077d07e96c]
>>>
>>> [compa163:20249] [13] /gpfs/asrc/home/voltaire/install//openmpi-1.4.3/lib/libmpi.so.0(PMPI_Gather+0x55e) [0x2b077883ec9a]
>>>
>>> [compa163:20249] [14] imb/src/IMB-MPI1(IMB_gather+0xe8) [0x40a088]
>>>
>>> [compa163:20249] [15] imb/src/IMB-MPI1(IMB_init_buffers_iter+0x28a) [0x405baa]
>>>
>>> [compa163:20249] [16] imb/src/IMB-MPI1(main+0x30f) [0x40362f]
>>>
>>> [compa163:20249] [17] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3669e1d994]
>>>
>>> [compa163:20249] [18] imb/src/IMB-MPI1 [0x403269]
>>>
>>> [compa163:20249] *** End of error message ***
>>>
>>>
>>> Any ideas? More debuggin tips?
>>>
>>> Thanks,
>>> Doron
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/