With this earlier failure do you know how many message may have been transferred between the two processes?  Is there a way to narrow this down to a small piece of code?  Do you have totalview or ddt at your disposal?

--td

Brian Smith wrote:
Also, the application I'm having trouble with appears to work fine with
MVAPICH2 1.4.1, if that is any help.

-Brian

On Tue, 2010-07-27 at 10:48 -0400, Terry Dontje wrote:
  
Can you try a simple point-to-point program?

--td

Brian Smith wrote: 
    
After running on two processors across two nodes, the problem occurs
much earlier during execution:

(gdb) bt
#0  opal_sys_timer_get_cycles ()
at ../opal/include/opal/sys/amd64/timer.h:46
#1  opal_timer_base_get_cycles ()
at ../opal/mca/timer/linux/timer_linux.h:31
#2  opal_progress () at runtime/opal_progress.c:181
#3  0x00002b4bc3c00215 in opal_condition_wait (count=2,
requests=0x7fff33372480, statuses=0x7fff33372450)
    at ../opal/threads/condition.h:99
#4  ompi_request_default_wait_all (count=2, requests=0x7fff33372480,
statuses=0x7fff33372450) at request/req_wait.c:262
#5  0x00002b4bc3c915b7 in ompi_coll_tuned_sendrecv_actual
(sendbuf=0x2aaad11dfaf0, scount=117692, 
    sdatatype=0x2b4bc3fa9ea0, dest=1, stag=-13, recvbuf=<value optimized
out>, rcount=117692, 
    rdatatype=0x2b4bc3fa9ea0, source=1, rtag=-13, comm=0x12cd98c0,
status=0x0) at coll_tuned_util.c:55
#6  0x00002b4bc3c982db in ompi_coll_tuned_sendrecv (sbuf=0x2aaad10f9d10,
scount=117692, sdtype=0x2b4bc3fa9ea0, 
    rbuf=0x2aaae104d010, rcount=117692, rdtype=0x2b4bc3fa9ea0,
comm=0x12cd98c0, module=0x12cda340)
    at coll_tuned_util.h:60
#7  ompi_coll_tuned_alltoall_intra_two_procs (sbuf=0x2aaad10f9d10,
scount=117692, sdtype=0x2b4bc3fa9ea0, 
    rbuf=0x2aaae104d010, rcount=117692, rdtype=0x2b4bc3fa9ea0,
comm=0x12cd98c0, module=0x12cda340)
    at coll_tuned_alltoall.c:432
#8  0x00002b4bc3c1b71f in PMPI_Alltoall (sendbuf=0x2aaad10f9d10,
sendcount=117692, sendtype=0x2b4bc3fa9ea0, 
    recvbuf=0x2aaae104d010, recvcount=117692, recvtype=0x2b4bc3fa9ea0,
comm=0x12cd98c0) at palltoall.c:84
#9  0x00002b4bc399cc86 in mpi_alltoall_f (sendbuf=0x2aaad10f9d10 "Z\n
\271\356\023\254\271?", sendcount=0x7fff33372688, 
    sendtype=<value optimized out>, recvbuf=0x2aaae104d010 "",
recvcount=0x7fff3337268c, recvtype=0xb67490, 
    comm=0x12d9d20, ierr=0x7fff33372690) at palltoall_f.c:76
#10 0x00000000004613b8 in m_alltoall_z_ ()
#11 0x00000000004ec55f in redis_pw_ ()
#12 0x00000000005643d0 in choleski_mp_orthch_ ()
#13 0x000000000043fbba in MAIN__ ()
#14 0x000000000042f15c in main ()

On Tue, 2010-07-27 at 06:14 -0400, Terry Dontje wrote:
  
      
A clarification from your previous email, you had your code working
with OMPI 1.4.1 but an older version of OFED?  Then you upgraded to
OFED 1.4 and things stopped working?  Sounds like your current system
is set up with OMPI 1.4.2 and OFED 1.5.  Anyways, I am a little
confused as to when things might have actually broke.

My first guess would be something may be wrong with the OFED setup.
Have checked the status of your ib devices with ibv_devinfo?  Have you
ran any of the OFED rc tests like ibv_rc_pingpong?  

If the above seems ok have you tried to run a simpler OMPI test like
connectivity?  I would see if a simple np=2 run spanning across two
nodes fails?

What OS distribution and version you are running on?

--td
Brian Smith wrote: 
    
        
In case my previous e-mail is too vague for anyone to address, here's a
backtrace from my application.  This version, compiled with Intel
11.1.064 (OpenMPI 1.4.2 w/ gcc 4.4.2), hangs during MPI_Alltoall
instead.  Running on 16 CPUs, Opteron 2427, Mellanox Technologies
MT25418 w/ OFED 1.5

strace on all ranks repeatedly shows:
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=7, events=POLLIN}, {fd=10, events=POLLIN}, {fd=22,
events=POLLIN}, {fd=23, events=POLLIN}], 7, 0) = 0 (Timeout)
...

gdb --pid=<pid>
(gdb) bt
#0  sm_fifo_read () at btl_sm.h:267
#1  mca_btl_sm_component_progress () at btl_sm_component.c:391
#2  0x00002b00085116ea in opal_progress () at
runtime/opal_progress.c:207
#3  0x00002b0007def215 in opal_condition_wait (count=2,
requests=0x7fffd27802a0, statuses=0x7fffd2780270)
    at ../opal/threads/condition.h:99
#4  ompi_request_default_wait_all (count=2, requests=0x7fffd27802a0,
statuses=0x7fffd2780270) at request/req_wait.c:262
#5  0x00002b0007e805b7 in ompi_coll_tuned_sendrecv_actual
(sendbuf=0x2aaac2c4c210, scount=28000, 
    sdatatype=0x2b0008198ea0, dest=6, stag=-13, recvbuf=<value optimized
out>, rcount=28000, rdatatype=0x2b0008198ea0, 
    source=10, rtag=-13, comm=0x16ad7420, status=0x0) at
coll_tuned_util.c:55
#6  0x00002b0007e8705f in ompi_coll_tuned_sendrecv (sbuf=0x2aaac2b04010,
scount=28000, sdtype=0x2b0008198ea0, 
    rbuf=0x2aaac99a2010, rcount=28000, rdtype=0x2b0008198ea0,
comm=0x16ad7420, module=0x16ad8450)
    at coll_tuned_util.h:60
#7  ompi_coll_tuned_alltoall_intra_pairwise (sbuf=0x2aaac2b04010,
scount=28000, sdtype=0x2b0008198ea0, 
    rbuf=0x2aaac99a2010, rcount=28000, rdtype=0x2b0008198ea0,
comm=0x16ad7420, module=0x16ad8450)
    at coll_tuned_alltoall.c:70
#8  0x00002b0007e0a71f in PMPI_Alltoall (sendbuf=0x2aaac2b04010,
sendcount=28000, sendtype=0x2b0008198ea0, 
    recvbuf=0x2aaac99a2010, recvcount=28000, recvtype=0x2b0008198ea0,
comm=0x16ad7420) at palltoall.c:84
#9  0x00002b0007b8bc86 in mpi_alltoall_f (sendbuf=0x2aaac2b04010 "",
sendcount=0x7fffd27806a0, 
    sendtype=<value optimized out>, 
    recvbuf=0x2aaac99a2010 "6%\177e\373\354\306>\346\226z\262\347\350
\260>\032ya(\303\003\272\276\231\343\322\363zjþ\230\247i\232\307PԾ(\304
\373\321D\261ľ\204֜Εh־H\266H\342l2\245\276\231C7]\003\250Ǿ`\277\231\272
\265E\261>j\213ѓ\370\002\263>НØx.\254>}\332-\313\371\326\320>\346\245f
\304\f\214\262\276\070\222zf#'\321>\024\066̆\026\227ɾ.T\277\266}\366
\270>h|\323L\330\fƾ^z\214!q*\277\276pQ?O\346\067\270>~\006\300",
recvcount=0x7fffd27806a4, recvtype=0xb67490, 
    comm=0x12d9ba0, ierr=0x7fffd27806a8) at palltoall_f.c:76
#10 0x00000000004634cc in m_sumf_d_ ()
#11 0x0000000000463072 in m_sum_z_ ()
#12 0x00000000004c8a8b in mrg_grid_rc_ ()
#13 0x00000000004ffc5e in rhosym_ ()
#14 0x0000000000610dc6 in us_mp_set_charge_ ()
#15 0x0000000000771c43 in elmin_ ()
#16 0x0000000000453853 in MAIN__ ()
#17 0x000000000042f15c in main ()

On other processes:

(gdb) bt
#0  0x0000003692a0b725 in pthread_spin_lock ()
from /lib64/libpthread.so.0
#1  0x00002aaaaacdfa7b in ibv_cmd_create_qp ()
from /usr/lib64/libmlx4-rdmav2.so
#2  0x00002b9dc1db3ff8 in progress_one_device ()
at /usr/include/infiniband/verbs.h:884
#3  btl_openib_component_progress () at btl_openib_component.c:3451
#4  0x00002b9dc24736ea in opal_progress () at
runtime/opal_progress.c:207
#5  0x00002b9dc1d51215 in opal_condition_wait (count=2,
requests=0x7fffece3cc20, statuses=0x7fffece3cbf0)
    at ../opal/threads/condition.h:99
#6  ompi_request_default_wait_all (count=2, requests=0x7fffece3cc20,
statuses=0x7fffece3cbf0) at request/req_wait.c:262
#7  0x00002b9dc1de25b7 in ompi_coll_tuned_sendrecv_actual
(sendbuf=0x2aaac2c4c210, scount=28000, 
    sdatatype=0x2b9dc20faea0, dest=6, stag=-13, recvbuf=<value optimized
out>, rcount=28000, rdatatype=0x2b9dc20faea0, 
    source=10, rtag=-13, comm=0x1745b420, status=0x0) at
coll_tuned_util.c:55
#8  0x00002b9dc1de905f in ompi_coll_tuned_sendrecv (sbuf=0x2aaac2b04010,
scount=28000, sdtype=0x2b9dc20faea0, 
    rbuf=0x2aaac99a2010, rcount=28000, rdtype=0x2b9dc20faea0,
comm=0x1745b420, module=0x1745c450)
    at coll_tuned_util.h:60
#9  ompi_coll_tuned_alltoall_intra_pairwise (sbuf=0x2aaac2b04010,
scount=28000, sdtype=0x2b9dc20faea0, 
    rbuf=0x2aaac99a2010, rcount=28000, rdtype=0x2b9dc20faea0,
comm=0x1745b420, module=0x1745c450)
    at coll_tuned_alltoall.c:70
#10 0x00002b9dc1d6c71f in PMPI_Alltoall (sendbuf=0x2aaac2b04010,
sendcount=28000, sendtype=0x2b9dc20faea0, 
    recvbuf=0x2aaac99a2010, recvcount=28000, recvtype=0x2b9dc20faea0,
comm=0x1745b420) at palltoall.c:84
#11 0x00002b9dc1aedc86 in mpi_alltoall_f (sendbuf=0x2aaac2b04010 "",
sendcount=0x7fffece3d020, 
    sendtype=<value optimized out>, 
    recvbuf=0x2aaac99a2010 "6%\177e\373\354\306>\346\226z\262\347\350
\260>\032ya(\303\003\272\276\231\343\322\363zjþ\230\247i\232\307PԾ(\304
\373\321D\261ľ\204֜Εh־H\266H\342l2\245\276\231C7]\003\250Ǿ`\277\231\272
\265E\261>j\213ѓ\370\002\263>НØx.\254>}\332-\313\371\326\320>\346\245f
\304\f\214\262\276\070\222zf#'\321>\024\066̆\026\227ɾ.T\277\266}\366
\270>h|\323L\330\fƾ^z\214!q*\277\276pQ?O\346\067\270>~\006\300",
recvcount=0x7fffece3d024, recvtype=0xb67490, 
    comm=0x12d9ba0, ierr=0x7fffece3d028) at palltoall_f.c:76
#12 0x00000000004634cc in m_sumf_d_ ()
#13 0x0000000000463072 in m_sum_z_ ()
#14 0x00000000004c8a8b in mrg_grid_rc_ ()
#15 0x00000000004ffc5e in rhosym_ ()
#16 0x0000000000610dc6 in us_mp_set_charge_ ()
#17 0x0000000000771c43 in elmin_ ()
#18 0x0000000000453853 in MAIN__ ()
#19 0x000000000042f15c in main ()


I set up padb to collect a full report on the process and I've attached
it to this message.  Let me know if I can provide anything further.

Thanks,
-Brian



On Wed, 2010-07-21 at 10:07 -0400, Brian Smith wrote:
  
      
          
Hi, All,

A couple of applications that I'm using -- VASP and Charmm -- end up
"stuck" (for lack of a better word) during a waitall call after some
non-blocking send/recv action.  This only happens when utilizing the
openib btl.  I've followed a couple of bugs where this seemed to happen
in some previous revisions and tried the work-arounds provided, but to
no avail.  I'm going to try running against a previous version to see if
its a regression of some sort, but this problem didn't seem to exist in
1.4.1 until our systems were updated to OFED >= 1.4.  Any suggestions
besides the obvious, "well, down-grade from >= 1.4"?  What additional
info can I provide to help?

Thanks,
-Brian

    
        
            
____________________________________________________________________

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
      
          
-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.dontje@oracle.com


    
        
  
      
-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.dontje@oracle.com


    


  


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.dontje@oracle.com