Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] deadlock when calling MPI_gatherv
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-04-27 09:33:53


Can you provide a small chunk of code that replicates the problem, perchance?

On Apr 27, 2010, at 9:22 AM, Terry Dontje wrote:

> How does the stack for the non-SM BTL run look, I assume it probably is the same? Also, can you dump the message queues for rank 1? What's interesting is you have a bunch of pending receives, do you expect that to be the case when the MPI_Gatherv occurred?
>
> --td
>
> Teng Lin wrote:
>> Hi,
>>
>> We recently ran into deadlock when calling MPI_gatherv with Open MPI 1.3.4. It seems to have something to do with sm at first. However, it still hangs even after turning off sm btl.
>>
>> Any idea how to track down the problem?
>>
>> Thanks,
>> Teng
>>
>> #################################################
>> Stack trace for master node
>> #################################################
>> mca_btl_sm_component_progress
>> opal_progress
>> opal_condition_wait
>> ompi_request_default_wait_all
>> ompi_coll_tuned_sendrecv_actual
>> ompi_coll_tuned_barrier_intra_two_procs
>> ompi_coll_tuned_barrier_intra_dec_fixed
>> mca_coll_sync_gatherv
>> PMPI_Gatherv
>>
>>
>> #################################################
>> Stack trace for slave node
>> #################################################
>> mca_btl_sm_component_progress
>> opal_progress
>> opal_condition_wait
>> ompi_request_wait_completion
>> mca_pml_ob1_recv
>> mca_coll_basic_gatherv_intra
>> mca_coll_sync_gatherv
>>
>>
>> #################################################
>> Message queue from totalview
>> ################################################
>> MPI_COMM_WORLD
>> Comm_size 2
>> Comm_rank 0
>> Pending receives : none
>> Unexpected messages : no information available
>> Pending sends : none
>>
>> MPI_COMM_SELF
>> Comm_size 1
>> Comm_rank 0
>> Pending receives : none
>> Unexpected messages : no information available
>> Pending sends : none
>>
>> MPI_COMM_NULL
>> Comm_size 0
>> Comm_rank -2
>> Pending receives : none
>> Unexpected messages : no information available
>> Pending sends : none
>>
>> MPI COMMUNICATOR 3 DUP FROM 0
>> Comm_size 2
>> Comm_rank 0
>> Pending receives : none
>> Unexpected messages : no information available
>> Pending sends : none
>>
>> MPI COMMUNICATOR 4 SPLIT FROM 3
>> Comm_size 2
>> Comm_rank 0
>> Pending receives : none
>> Unexpected messages : no information available
>> Pending sends : none
>>
>> MPI COMMUNICATOR 5 SPLIT FROM 4
>> Comm_size 2
>> Comm_rank 0
>> Pending receives : none
>> Unexpected messages : no information available
>> Pending sends : none
>>
>> MPI COMMUNICATOR 6 SPLIT FROM 4
>> Comm_size 1
>> Comm_rank 0
>> Pending receives : none
>> Unexpected messages : no information available
>> Pending sends : none
>>
>> MPI COMMUNICATOR 7 DUP FROM 4
>> Comm_size 2
>> Comm_rank 0
>> Pending receives
>> [0]
>> Receive: 0x80b9000
>> Data: 1 * MPI_CHAR
>> Status Pending
>> Source 0 (orterun<xxxx>.0)
>> Tag 7 (0x00000007)
>> User Buffer 0xb06fa010 -> 0x00000000 (0)
>> Buffer Length 1359312 (0x0014bdd0)
>> [1]
>> Receive: 0x80b9200
>> Data: 1 * MPI_CHAR
>> Status Pending
>> Source 0 (orterun<xxxx>.0)
>> Tag 5 (0x00000005)
>> User Buffer 0xb0c2a010 -> 0x00000000 (0)
>> Buffer Length 1359312 (0x0014bdd0)
>> [2]
>> Receive: 0x80b9400
>> Data: 1 * MPI_CHAR
>> Status Pending
>> Source 1 (orterun<xxxx>.1)
>> Tag 3 (0x00000003)
>> User Buffer 0xb115a010 -> 0xc0ef9e79 (-1058038151)
>> Buffer Length 1359312 (0x0014bdd0)
>> [3]
>> Receive: 0x80b9600
>> Data: 1 * MPI_CHAR
>> Status Pending
>> Source 1 (orterun<xxxx>.1)
>> Tag 1 (0x00000001)
>> User Buffer 0xb168a010 -> 0xc0c662aa (-1060740438)
>> Buffer Length 1359312 (0x0014bdd0)
>> [4]
>> Receive: 0x82a2500
>> Data: 1 * MPI_CHAR
>> Status Pending
>> Source 0 (orterun<xxxx>.0)
>> Tag 11 (0x0000000b)
>> User Buffer 0xafc9a010 -> 0x00000000 (0)
>> Buffer Length 1359312 (0x0014bdd0)
>> [5]
>> Receive: 0x82a2700
>> Data: 1 * MPI_CHAR
>> Status Pending
>> Source 0 (orterun<xxxx>.0)
>> Tag 9 (0x00000009)
>> User Buffer 0xb01ca010 -> 0x00000000 (0)
>> Buffer Length 1359312 (0x0014bdd0)
>>
>> Unexpected messages : no information available
>> Pending sends
>> [0]
>> Send: 0x80b8500
>> Data transfer completed
>> Status Complete
>> Target 0 (orterun<xxxx>.0)
>> Tag 4 (0x00000004)
>> Buffer 0xb0846010 -> 0x40544279 (1079263865)
>> Buffer Length 2548 (0x000009f4)
>> [1]
>> Send: 0x80b8780
>> Data transfer completed
>> Status Complete
>> Target 0 (orterun<xxxx>.0)
>> Tag 6 (0x00000006)
>> Buffer 0xb0d76010 -> 0x41a756bf (1101485759)
>> Buffer Length 2992 (0x00000bb0)
>> [2]
>> Send: 0x80b8a00
>> Data transfer completed
>> Status Complete
>> Target 1 (orterun<xxxx>.1)
>> Tag 0 (0x00000000)
>> Buffer 0xb12a6010 -> 0xbf94cfc4 (-1080766524)
>> Buffer Length 3856 (0x00000f10)
>> [3]
>> Send: 0x80b8c80
>> Data transfer completed
>> Status Complete
>> Target 1 (orterun<xxxx>.1)
>> Tag 2 (0x00000002)
>> Buffer 0xb17d6010 -> 0x400a1a6c (1074403948)
>> Buffer Length 3952 (0x00000f70)
>> [4]
>> Send: 0x831f080
>> Data transfer completed
>> Status Complete
>> Target 0 (orterun<xxxx>.0)
>> Tag 8 (0x00000008)
>> Buffer 0xafde6010 -> 0xc0de2c50 (-1059181488)
>> Buffer Length 3292 (0x00000cdc)
>> [5]
>> Send: 0x831f300
>> Data transfer completed
>> Status Complete
>> Target 0 (orterun<xxxx>.0)
>> Tag 10 (0x0000000a)
>> Buffer 0xb0316010 -> 0x41169ca7 (1092000935)
>> Buffer Length 3232 (0x00000ca0)
>>
>> MPI COMMUNICATOR 8 SPLIT FROM 5
>> Comm_size 2
>> Comm_rank 0
>> Pending receives : none
>> Unexpected messages : no information available
>> Pending sends : none
>>
>> MPI COMMUNICATOR 9 SPLIT FROM 5
>> Comm_size 2
>> Comm_rank 0
>> Pending receives : none
>> Unexpected messages : no information available
>> Pending sends : none
>>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>>
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> <ATT17906524.gif>
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.650.633.7054
> Oracle - Performance Technologies
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden]
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/