Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] hang in mpi_allreduce in single linux machineþ
From: William Au (au_wai_chung_at_[hidden])
Date: 2012-06-27 14:59:22


Hi,

When I ran multiple processes in a single machine, the programs are
hanging in mpi_allreduce in different points

during different runs. I am using 1.3.4. When I used different machines
to run the processes, it is OK. Also, when

I recompiled open mpi in debug mode, the problem goes away. Since the
hangings occurred at different points,

I suspect some race/deadlock situations that due to some optimization in
openmpi. I used -O3 in compilation with

gcc44 and gfortran44. The software I am running in MUMPS (4.10.0). Other
platforms (solaris 10) do not have

this problem. Any suggestion I should try out?

Here is stack:

#0 mca_btl_sm_component_progress () at btl_sm_component.c:387
#1 0x00002b304a4e1f3a in opal_progress () at runtime/opal_progress.c:207
#2 0x00002b3049e20fa5 in opal_condition_wait (count=2, requests=0x7fff1376d850, statuses=0x0)
    at ../opal/threads/condition.h:99
#3 ompi_request_default_wait_all (count=2, requests=0x7fff1376d850, statuses=0x0)
    at request/req_wait.c:262
#4 0x00002b304ecb4952 in ompi_coll_tuned_allreduce_intra_recursivedoubling (
    sbuf=<value optimized out>, rbuf=0x14c9da10, count=1, dtype=0x2b304a085d40, op=0x2b304a07d280,
    comm=0x14ca34d0, module=0x14ca0500) at coll_tuned_allreduce.c:223
#5 0x00002b3049e36384 in PMPI_Allreduce (sendbuf=0x14c9d8d0, recvbuf=0x14c9da10, count=1,
    datatype=<value optimized out>, op=0x2b304a07d280, comm=0x14ca34d0) at pallreduce.c:102
#6 0x00002b304a0b9bd3 in mpi_allreduce_f (sendbuf=0x14c9d8d0 "", recvbuf=0x14c9da10 "",
    count=0x626eb0, datatype=<value optimized out>, op=0x626ec0, comm=<value optimized out>,
    ierr=0x7fff1376e530) at pallreduce_f.c:77
#7 0x000000000049dbd4 in dmumps_142 (id=...) at dmumps_part5.F:5570

Thanks.

Willia