Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] deadlock when calling MPI_gatherv
From: Terry Dontje (terry.dontje_at_[hidden])
Date: 2010-04-27 09:22:52


How does the stack for the non-SM BTL run look, I assume it probably is
the same? Also, can you dump the message queues for rank 1? What's
interesting is you have a bunch of pending receives, do you expect that
to be the case when the MPI_Gatherv occurred?

--td

Teng Lin wrote:
> Hi,
>
> We recently ran into deadlock when calling MPI_gatherv with Open MPI 1.3.4. It seems to have something to do with sm at first. However, it still hangs even after turning off sm btl.
>
> Any idea how to track down the problem?
>
> Thanks,
> Teng
>
> #################################################
> Stack trace for master node
> #################################################
> mca_btl_sm_component_progress
> opal_progress
> opal_condition_wait
> ompi_request_default_wait_all
> ompi_coll_tuned_sendrecv_actual
> ompi_coll_tuned_barrier_intra_two_procs
> ompi_coll_tuned_barrier_intra_dec_fixed
> mca_coll_sync_gatherv
> PMPI_Gatherv
>
>
> #################################################
> Stack trace for slave node
> #################################################
> mca_btl_sm_component_progress
> opal_progress
> opal_condition_wait
> ompi_request_wait_completion
> mca_pml_ob1_recv
> mca_coll_basic_gatherv_intra
> mca_coll_sync_gatherv
>
>
> #################################################
> Message queue from totalview
> ################################################
> MPI_COMM_WORLD
> Comm_size 2
> Comm_rank 0
> Pending receives : none
> Unexpected messages : no information available
> Pending sends : none
>
> MPI_COMM_SELF
> Comm_size 1
> Comm_rank 0
> Pending receives : none
> Unexpected messages : no information available
> Pending sends : none
>
> MPI_COMM_NULL
> Comm_size 0
> Comm_rank -2
> Pending receives : none
> Unexpected messages : no information available
> Pending sends : none
>
> MPI COMMUNICATOR 3 DUP FROM 0
> Comm_size 2
> Comm_rank 0
> Pending receives : none
> Unexpected messages : no information available
> Pending sends : none
>
> MPI COMMUNICATOR 4 SPLIT FROM 3
> Comm_size 2
> Comm_rank 0
> Pending receives : none
> Unexpected messages : no information available
> Pending sends : none
>
> MPI COMMUNICATOR 5 SPLIT FROM 4
> Comm_size 2
> Comm_rank 0
> Pending receives : none
> Unexpected messages : no information available
> Pending sends : none
>
> MPI COMMUNICATOR 6 SPLIT FROM 4
> Comm_size 1
> Comm_rank 0
> Pending receives : none
> Unexpected messages : no information available
> Pending sends : none
>
> MPI COMMUNICATOR 7 DUP FROM 4
> Comm_size 2
> Comm_rank 0
> Pending receives
> [0]
> Receive: 0x80b9000
> Data: 1 * MPI_CHAR
> Status Pending
> Source 0 (orterun<xxxx>.0)
> Tag 7 (0x00000007)
> User Buffer 0xb06fa010 -> 0x00000000 (0)
> Buffer Length 1359312 (0x0014bdd0)
> [1]
> Receive: 0x80b9200
> Data: 1 * MPI_CHAR
> Status Pending
> Source 0 (orterun<xxxx>.0)
> Tag 5 (0x00000005)
> User Buffer 0xb0c2a010 -> 0x00000000 (0)
> Buffer Length 1359312 (0x0014bdd0)
> [2]
> Receive: 0x80b9400
> Data: 1 * MPI_CHAR
> Status Pending
> Source 1 (orterun<xxxx>.1)
> Tag 3 (0x00000003)
> User Buffer 0xb115a010 -> 0xc0ef9e79 (-1058038151)
> Buffer Length 1359312 (0x0014bdd0)
> [3]
> Receive: 0x80b9600
> Data: 1 * MPI_CHAR
> Status Pending
> Source 1 (orterun<xxxx>.1)
> Tag 1 (0x00000001)
> User Buffer 0xb168a010 -> 0xc0c662aa (-1060740438)
> Buffer Length 1359312 (0x0014bdd0)
> [4]
> Receive: 0x82a2500
> Data: 1 * MPI_CHAR
> Status Pending
> Source 0 (orterun<xxxx>.0)
> Tag 11 (0x0000000b)
> User Buffer 0xafc9a010 -> 0x00000000 (0)
> Buffer Length 1359312 (0x0014bdd0)
> [5]
> Receive: 0x82a2700
> Data: 1 * MPI_CHAR
> Status Pending
> Source 0 (orterun<xxxx>.0)
> Tag 9 (0x00000009)
> User Buffer 0xb01ca010 -> 0x00000000 (0)
> Buffer Length 1359312 (0x0014bdd0)
>
> Unexpected messages : no information available
> Pending sends
> [0]
> Send: 0x80b8500
> Data transfer completed
> Status Complete
> Target 0 (orterun<xxxx>.0)
> Tag 4 (0x00000004)
> Buffer 0xb0846010 -> 0x40544279 (1079263865)
> Buffer Length 2548 (0x000009f4)
> [1]
> Send: 0x80b8780
> Data transfer completed
> Status Complete
> Target 0 (orterun<xxxx>.0)
> Tag 6 (0x00000006)
> Buffer 0xb0d76010 -> 0x41a756bf (1101485759)
> Buffer Length 2992 (0x00000bb0)
> [2]
> Send: 0x80b8a00
> Data transfer completed
> Status Complete
> Target 1 (orterun<xxxx>.1)
> Tag 0 (0x00000000)
> Buffer 0xb12a6010 -> 0xbf94cfc4 (-1080766524)
> Buffer Length 3856 (0x00000f10)
> [3]
> Send: 0x80b8c80
> Data transfer completed
> Status Complete
> Target 1 (orterun<xxxx>.1)
> Tag 2 (0x00000002)
> Buffer 0xb17d6010 -> 0x400a1a6c (1074403948)
> Buffer Length 3952 (0x00000f70)
> [4]
> Send: 0x831f080
> Data transfer completed
> Status Complete
> Target 0 (orterun<xxxx>.0)
> Tag 8 (0x00000008)
> Buffer 0xafde6010 -> 0xc0de2c50 (-1059181488)
> Buffer Length 3292 (0x00000cdc)
> [5]
> Send: 0x831f300
> Data transfer completed
> Status Complete
> Target 0 (orterun<xxxx>.0)
> Tag 10 (0x0000000a)
> Buffer 0xb0316010 -> 0x41169ca7 (1092000935)
> Buffer Length 3232 (0x00000ca0)
>
> MPI COMMUNICATOR 8 SPLIT FROM 5
> Comm_size 2
> Comm_rank 0
> Pending receives : none
> Unexpected messages : no information available
> Pending sends : none
>
> MPI COMMUNICATOR 9 SPLIT FROM 5
> Comm_size 2
> Comm_rank 0
> Pending receives : none
> Unexpected messages : no information available
> Pending sends : none
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>



picture