Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] MPI_COMM_DUP freeze with OpenMPI 1.4.1
From: francoise.roch_at_[hidden]
Date: 2011-05-10 09:43:34


Hi,

I compile a parallel program with OpenMPI 1.4.1 (compiled with intel
compilers 12 from composerxe package) . This program is linked to MUMPS
library 4.9.2, compiled with the same compilers and link with intel
MKL. The OS is linux debian.
No error in compiling or running the job, but the program freeze inside
a call to "zmumps" routine, when the slaves process call MPI_COMM_DUP
routine.

The program is executed on 2 nodes of 12 cores each (westmere
processors) with the following command :

mpirun -np 24 --machinefile $OAR_NODE_FILE -mca plm_rsh_agent "oarsh"
--mca btl self,openib -x LD_LIBRARY_PATH ./prog

We have 12 process running on each node. We submit the job with OAR
batch scheduler (the $OAR_NODE_FILE variable and "oarsh" command are
specific to this scheduler and are usually working well with openmpi )

via gdb, on the slaves, we can see that they are blocked in MPI_COMM_DUP :

(gdb) where
#0 0x00002b32c1533113 in poll () from /lib/libc.so.6
#1 0x0000000000adf52c in poll_dispatch ()
#2 0x0000000000adcea3 in opal_event_loop ()
#3 0x0000000000ad69f9 in opal_progress ()
#4 0x0000000000a34b4e in mca_pml_ob1_recv ()
#5 0x00000000009b0768 in
ompi_coll_tuned_allreduce_intra_recursivedoubling ()
#6 0x00000000009ac829 in ompi_coll_tuned_allreduce_intra_dec_fixed ()
#7 0x000000000097e271 in ompi_comm_allreduce_intra ()
#8 0x000000000097dd06 in ompi_comm_nextcid ()
#9 0x000000000097be01 in ompi_comm_dup ()
#10 0x00000000009a0785 in PMPI_Comm_dup ()
#11 0x000000000097931d in pmpi_comm_dup__ ()
#12 0x0000000000644251 in zmumps (id=...) at zmumps_part1.F:144
#13 0x00000000004c0d03 in sub_pbdirect_init (id=..., matrix_build=...)
at sub_pbdirect_init.f90:44
#14 0x0000000000628706 in fwt2d_elas_v2 () at fwt2d_elas.f90:1048

the master wait further :

(gdb) where
#0 0x00002b9dc9f3e113 in poll () from /lib/libc.so.6
#1 0x0000000000adf52c in poll_dispatch ()
#2 0x0000000000adcea3 in opal_event_loop ()
#3 0x0000000000ad69f9 in opal_progress ()
#4 0x000000000098f294 in ompi_request_default_wait_all ()
#5 0x0000000000a06e56 in ompi_coll_tuned_sendrecv_actual ()
#6 0x00000000009ab8e3 in ompi_coll_tuned_barrier_intra_bruck ()
#7 0x00000000009ac926 in ompi_coll_tuned_barrier_intra_dec_fixed ()
#8 0x00000000009a0b20 in PMPI_Barrier ()
#9 0x0000000000978c93 in pmpi_barrier__ ()
#10 0x00000000004c0dc4 in sub_pbdirect_init (id=..., matrix_build=...)
at sub_pbdirect_init.f90:62
#11 0x0000000000628706 in fwt2d_elas_v2 () at fwt2d_elas.f90:1048

Remark :
The same code compiled and run well with intel MPI library, from the
same intel package, on the same nodes.

Thanks for any help

Françoise Roch