Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Processes stuck after MPI_Waitall() in 1.4.1
From: Terry Dontje (terry.dontje_at_[hidden])
Date: 2010-07-28 06:56:31


Here are a couple other suggestions:

1. Have you tried your code with using the TCP btl just to make sure
this might not be a general algorithm issue with the collective?

2. While using the openib btl you may want to try things with rdma
turned off by using the following parameters to mpirun:
-mca btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0 -mca
btl_openib_flags 1

3. While using the openib btl you may want to try things while bumping
up the rendezvous limit to see if eliminating rendezvous messages helps
things (others on the list is there an easier way to do this?). Set the
following parameters raising the 8192 and 2048 values:
-mca btl_openib_receive_queues "P,8192" -mca btl_openib_max_send_size
8192 -mca btl_openib_eager_limit 8192 -mca btl_openib_rndv_eager_limit 2048

4. You mayb also want to try and see if the basic collective algorithm
work instead of using the tuned, which is the default I believe, by
setting "-mca coll_basic_priority 100". The idea here is to determine
if the tuned collective itself is tickling the issue.

--td

Terry Dontje wrote:
> With this earlier failure do you know how many message may have been
> transferred between the two processes? Is there a way to narrow this
> down to a small piece of code? Do you have totalview or ddt at your
> disposal?
>
> --td
>
> Brian Smith wrote:
>> Also, the application I'm having trouble with appears to work fine with
>> MVAPICH2 1.4.1, if that is any help.
>>
>> -Brian
>>
>> On Tue, 2010-07-27 at 10:48 -0400, Terry Dontje wrote:
>>
>>> Can you try a simple point-to-point program?
>>>
>>> --td
>>>
>>> Brian Smith wrote:
>>>
>>>> After running on two processors across two nodes, the problem occurs
>>>> much earlier during execution:
>>>>
>>>> (gdb) bt
>>>> #0 opal_sys_timer_get_cycles ()
>>>> at ../opal/include/opal/sys/amd64/timer.h:46
>>>> #1 opal_timer_base_get_cycles ()
>>>> at ../opal/mca/timer/linux/timer_linux.h:31
>>>> #2 opal_progress () at runtime/opal_progress.c:181
>>>> #3 0x00002b4bc3c00215 in opal_condition_wait (count=2,
>>>> requests=0x7fff33372480, statuses=0x7fff33372450)
>>>> at ../opal/threads/condition.h:99
>>>> #4 ompi_request_default_wait_all (count=2, requests=0x7fff33372480,
>>>> statuses=0x7fff33372450) at request/req_wait.c:262
>>>> #5 0x00002b4bc3c915b7 in ompi_coll_tuned_sendrecv_actual
>>>> (sendbuf=0x2aaad11dfaf0, scount=117692,
>>>> sdatatype=0x2b4bc3fa9ea0, dest=1, stag=-13, recvbuf=<value optimized
>>>> out>, rcount=117692,
>>>> rdatatype=0x2b4bc3fa9ea0, source=1, rtag=-13, comm=0x12cd98c0,
>>>> status=0x0) at coll_tuned_util.c:55
>>>> #6 0x00002b4bc3c982db in ompi_coll_tuned_sendrecv (sbuf=0x2aaad10f9d10,
>>>> scount=117692, sdtype=0x2b4bc3fa9ea0,
>>>> rbuf=0x2aaae104d010, rcount=117692, rdtype=0x2b4bc3fa9ea0,
>>>> comm=0x12cd98c0, module=0x12cda340)
>>>> at coll_tuned_util.h:60
>>>> #7 ompi_coll_tuned_alltoall_intra_two_procs (sbuf=0x2aaad10f9d10,
>>>> scount=117692, sdtype=0x2b4bc3fa9ea0,
>>>> rbuf=0x2aaae104d010, rcount=117692, rdtype=0x2b4bc3fa9ea0,
>>>> comm=0x12cd98c0, module=0x12cda340)
>>>> at coll_tuned_alltoall.c:432
>>>> #8 0x00002b4bc3c1b71f in PMPI_Alltoall (sendbuf=0x2aaad10f9d10,
>>>> sendcount=117692, sendtype=0x2b4bc3fa9ea0,
>>>> recvbuf=0x2aaae104d010, recvcount=117692, recvtype=0x2b4bc3fa9ea0,
>>>> comm=0x12cd98c0) at palltoall.c:84
>>>> #9 0x00002b4bc399cc86 in mpi_alltoall_f (sendbuf=0x2aaad10f9d10 "Z\n
>>>> \271\356\023\254\271?", sendcount=0x7fff33372688,
>>>> sendtype=<value optimized out>, recvbuf=0x2aaae104d010 "",
>>>> recvcount=0x7fff3337268c, recvtype=0xb67490,
>>>> comm=0x12d9d20, ierr=0x7fff33372690) at palltoall_f.c:76
>>>> #10 0x00000000004613b8 in m_alltoall_z_ ()
>>>> #11 0x00000000004ec55f in redis_pw_ ()
>>>> #12 0x00000000005643d0 in choleski_mp_orthch_ ()
>>>> #13 0x000000000043fbba in MAIN__ ()
>>>> #14 0x000000000042f15c in main ()
>>>>
>>>> On Tue, 2010-07-27 at 06:14 -0400, Terry Dontje wrote:
>>>>
>>>>
>>>>> A clarification from your previous email, you had your code working
>>>>> with OMPI 1.4.1 but an older version of OFED? Then you upgraded to
>>>>> OFED 1.4 and things stopped working? Sounds like your current system
>>>>> is set up with OMPI 1.4.2 and OFED 1.5. Anyways, I am a little
>>>>> confused as to when things might have actually broke.
>>>>>
>>>>> My first guess would be something may be wrong with the OFED setup.
>>>>> Have checked the status of your ib devices with ibv_devinfo? Have you
>>>>> ran any of the OFED rc tests like ibv_rc_pingpong?
>>>>>
>>>>> If the above seems ok have you tried to run a simpler OMPI test like
>>>>> connectivity? I would see if a simple np=2 run spanning across two
>>>>> nodes fails?
>>>>>
>>>>> What OS distribution and version you are running on?
>>>>>
>>>>> --td
>>>>> Brian Smith wrote:
>>>>>
>>>>>
>>>>>> In case my previous e-mail is too vague for anyone to address, here's a
>>>>>> backtrace from my application. This version, compiled with Intel
>>>>>> 11.1.064 (OpenMPI 1.4.2 w/ gcc 4.4.2), hangs during MPI_Alltoall
>>>>>> instead. Running on 16 CPUs, Opteron 2427, Mellanox Technologies
>>>>>> MT25418 w/ OFED 1.5
>>>>>>
>>>>>> strace on all ranks repeatedly shows:
>>>>>> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
>>>>>> events=POLLIN}, {fd=7, events=POLLIN}, {fd=10, events=POLLIN}, {fd=22,
>>>>>> events=POLLIN}, {fd=23, events=POLLIN}], 7, 0) = 0 (Timeout)
>>>>>> ...
>>>>>>
>>>>>> gdb --pid=<pid>
>>>>>> (gdb) bt
>>>>>> #0 sm_fifo_read () at btl_sm.h:267
>>>>>> #1 mca_btl_sm_component_progress () at btl_sm_component.c:391
>>>>>> #2 0x00002b00085116ea in opal_progress () at
>>>>>> runtime/opal_progress.c:207
>>>>>> #3 0x00002b0007def215 in opal_condition_wait (count=2,
>>>>>> requests=0x7fffd27802a0, statuses=0x7fffd2780270)
>>>>>> at ../opal/threads/condition.h:99
>>>>>> #4 ompi_request_default_wait_all (count=2, requests=0x7fffd27802a0,
>>>>>> statuses=0x7fffd2780270) at request/req_wait.c:262
>>>>>> #5 0x00002b0007e805b7 in ompi_coll_tuned_sendrecv_actual
>>>>>> (sendbuf=0x2aaac2c4c210, scount=28000,
>>>>>> sdatatype=0x2b0008198ea0, dest=6, stag=-13, recvbuf=<value optimized
>>>>>> out>, rcount=28000, rdatatype=0x2b0008198ea0,
>>>>>> source=10, rtag=-13, comm=0x16ad7420, status=0x0) at
>>>>>> coll_tuned_util.c:55
>>>>>> #6 0x00002b0007e8705f in ompi_coll_tuned_sendrecv (sbuf=0x2aaac2b04010,
>>>>>> scount=28000, sdtype=0x2b0008198ea0,
>>>>>> rbuf=0x2aaac99a2010, rcount=28000, rdtype=0x2b0008198ea0,
>>>>>> comm=0x16ad7420, module=0x16ad8450)
>>>>>> at coll_tuned_util.h:60
>>>>>> #7 ompi_coll_tuned_alltoall_intra_pairwise (sbuf=0x2aaac2b04010,
>>>>>> scount=28000, sdtype=0x2b0008198ea0,
>>>>>> rbuf=0x2aaac99a2010, rcount=28000, rdtype=0x2b0008198ea0,
>>>>>> comm=0x16ad7420, module=0x16ad8450)
>>>>>> at coll_tuned_alltoall.c:70
>>>>>> #8 0x00002b0007e0a71f in PMPI_Alltoall (sendbuf=0x2aaac2b04010,
>>>>>> sendcount=28000, sendtype=0x2b0008198ea0,
>>>>>> recvbuf=0x2aaac99a2010, recvcount=28000, recvtype=0x2b0008198ea0,
>>>>>> comm=0x16ad7420) at palltoall.c:84
>>>>>> #9 0x00002b0007b8bc86 in mpi_alltoall_f (sendbuf=0x2aaac2b04010 "",
>>>>>> sendcount=0x7fffd27806a0,
>>>>>> sendtype=<value optimized out>,
>>>>>> recvbuf=0x2aaac99a2010 "6%\177e\373\354\306>\346\226z\262\347\350
>>>>>> \260>\032ya(\303\003\272\276\231\343\322\363zjþ\230\247i\232\307PԾ(\304
>>>>>> \373\321D\261ľ\204֜Εh־H\266H\342l2\245\276\231C7]\003\250Ǿ`\277\231\272
>>>>>> \265E\261>j\213ѓ\370\002\263>НØx.\254>}\332-\313\371\326\320>\346\245f
>>>>>> \304\f\214\262\276\070\222zf#'\321>\024\066̆\026\227ɾ.T\277\266}\366
>>>>>> \270>h|\323L\330\fƾ^z\214!q*\277\276pQ?O\346\067\270>~\006\300",
>>>>>> recvcount=0x7fffd27806a4, recvtype=0xb67490,
>>>>>> comm=0x12d9ba0, ierr=0x7fffd27806a8) at palltoall_f.c:76
>>>>>> #10 0x00000000004634cc in m_sumf_d_ ()
>>>>>> #11 0x0000000000463072 in m_sum_z_ ()
>>>>>> #12 0x00000000004c8a8b in mrg_grid_rc_ ()
>>>>>> #13 0x00000000004ffc5e in rhosym_ ()
>>>>>> #14 0x0000000000610dc6 in us_mp_set_charge_ ()
>>>>>> #15 0x0000000000771c43 in elmin_ ()
>>>>>> #16 0x0000000000453853 in MAIN__ ()
>>>>>> #17 0x000000000042f15c in main ()
>>>>>>
>>>>>> On other processes:
>>>>>>
>>>>>> (gdb) bt
>>>>>> #0 0x0000003692a0b725 in pthread_spin_lock ()
>>>>>> from /lib64/libpthread.so.0
>>>>>> #1 0x00002aaaaacdfa7b in ibv_cmd_create_qp ()
>>>>>> from /usr/lib64/libmlx4-rdmav2.so
>>>>>> #2 0x00002b9dc1db3ff8 in progress_one_device ()
>>>>>> at /usr/include/infiniband/verbs.h:884
>>>>>> #3 btl_openib_component_progress () at btl_openib_component.c:3451
>>>>>> #4 0x00002b9dc24736ea in opal_progress () at
>>>>>> runtime/opal_progress.c:207
>>>>>> #5 0x00002b9dc1d51215 in opal_condition_wait (count=2,
>>>>>> requests=0x7fffece3cc20, statuses=0x7fffece3cbf0)
>>>>>> at ../opal/threads/condition.h:99
>>>>>> #6 ompi_request_default_wait_all (count=2, requests=0x7fffece3cc20,
>>>>>> statuses=0x7fffece3cbf0) at request/req_wait.c:262
>>>>>> #7 0x00002b9dc1de25b7 in ompi_coll_tuned_sendrecv_actual
>>>>>> (sendbuf=0x2aaac2c4c210, scount=28000,
>>>>>> sdatatype=0x2b9dc20faea0, dest=6, stag=-13, recvbuf=<value optimized
>>>>>> out>, rcount=28000, rdatatype=0x2b9dc20faea0,
>>>>>> source=10, rtag=-13, comm=0x1745b420, status=0x0) at
>>>>>> coll_tuned_util.c:55
>>>>>> #8 0x00002b9dc1de905f in ompi_coll_tuned_sendrecv (sbuf=0x2aaac2b04010,
>>>>>> scount=28000, sdtype=0x2b9dc20faea0,
>>>>>> rbuf=0x2aaac99a2010, rcount=28000, rdtype=0x2b9dc20faea0,
>>>>>> comm=0x1745b420, module=0x1745c450)
>>>>>> at coll_tuned_util.h:60
>>>>>> #9 ompi_coll_tuned_alltoall_intra_pairwise (sbuf=0x2aaac2b04010,
>>>>>> scount=28000, sdtype=0x2b9dc20faea0,
>>>>>> rbuf=0x2aaac99a2010, rcount=28000, rdtype=0x2b9dc20faea0,
>>>>>> comm=0x1745b420, module=0x1745c450)
>>>>>> at coll_tuned_alltoall.c:70
>>>>>> #10 0x00002b9dc1d6c71f in PMPI_Alltoall (sendbuf=0x2aaac2b04010,
>>>>>> sendcount=28000, sendtype=0x2b9dc20faea0,
>>>>>> recvbuf=0x2aaac99a2010, recvcount=28000, recvtype=0x2b9dc20faea0,
>>>>>> comm=0x1745b420) at palltoall.c:84
>>>>>> #11 0x00002b9dc1aedc86 in mpi_alltoall_f (sendbuf=0x2aaac2b04010 "",
>>>>>> sendcount=0x7fffece3d020,
>>>>>> sendtype=<value optimized out>,
>>>>>> recvbuf=0x2aaac99a2010 "6%\177e\373\354\306>\346\226z\262\347\350
>>>>>> \260>\032ya(\303\003\272\276\231\343\322\363zjþ\230\247i\232\307PԾ(\304
>>>>>> \373\321D\261ľ\204֜Εh־H\266H\342l2\245\276\231C7]\003\250Ǿ`\277\231\272
>>>>>> \265E\261>j\213ѓ\370\002\263>НØx.\254>}\332-\313\371\326\320>\346\245f
>>>>>> \304\f\214\262\276\070\222zf#'\321>\024\066̆\026\227ɾ.T\277\266}\366
>>>>>> \270>h|\323L\330\fƾ^z\214!q*\277\276pQ?O\346\067\270>~\006\300",
>>>>>> recvcount=0x7fffece3d024, recvtype=0xb67490,
>>>>>> comm=0x12d9ba0, ierr=0x7fffece3d028) at palltoall_f.c:76
>>>>>> #12 0x00000000004634cc in m_sumf_d_ ()
>>>>>> #13 0x0000000000463072 in m_sum_z_ ()
>>>>>> #14 0x00000000004c8a8b in mrg_grid_rc_ ()
>>>>>> #15 0x00000000004ffc5e in rhosym_ ()
>>>>>> #16 0x0000000000610dc6 in us_mp_set_charge_ ()
>>>>>> #17 0x0000000000771c43 in elmin_ ()
>>>>>> #18 0x0000000000453853 in MAIN__ ()
>>>>>> #19 0x000000000042f15c in main ()
>>>>>>
>>>>>>
>>>>>> I set up padb to collect a full report on the process and I've attached
>>>>>> it to this message. Let me know if I can provide anything further.
>>>>>>
>>>>>> Thanks,
>>>>>> -Brian
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, 2010-07-21 at 10:07 -0400, Brian Smith wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi, All,
>>>>>>>
>>>>>>> A couple of applications that I'm using -- VASP and Charmm -- end up
>>>>>>> "stuck" (for lack of a better word) during a waitall call after some
>>>>>>> non-blocking send/recv action. This only happens when utilizing the
>>>>>>> openib btl. I've followed a couple of bugs where this seemed to happen
>>>>>>> in some previous revisions and tried the work-arounds provided, but to
>>>>>>> no avail. I'm going to try running against a previous version to see if
>>>>>>> its a regression of some sort, but this problem didn't seem to exist in
>>>>>>> 1.4.1 until our systems were updated to OFED >= 1.4. Any suggestions
>>>>>>> besides the obvious, "well, down-grade from >= 1.4"? What additional
>>>>>>> info can I provide to help?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -Brian
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> ____________________________________________________________________
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>> --
>>>>> Oracle
>>>>> Terry D. Dontje | Principal Software Engineer
>>>>> Developer Tools Engineering | +1.650.633.7054
>>>>> Oracle - Performance Technologies
>>>>> 95 Network Drive, Burlington, MA 01803
>>>>> Email terry.dontje_at_[hidden]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>> --
>>> Oracle
>>> Terry D. Dontje | Principal Software Engineer
>>> Developer Tools Engineering | +1.650.633.7054
>>> Oracle - Performance Technologies
>>> 95 Network Drive, Burlington, MA 01803
>>> Email terry.dontje_at_[hidden]
>>>
>>>
>>>
>>
>>
>>
>
>
> --
> Oracle
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.650.633.7054
> Oracle * - Performance Technologies*
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>



picture
picture