Subject: [OMPI users] problem with progress thread and orte
From: Dong Li (lid_at_[hidden])
Date: 2010-01-08 11:38:21

Hi, guys.
My application got stuck when I run an application with Open MPI 1.4
with progress thead enabled.

The OpenMPI is configured and compiled with the following options.
./configure --with-openib=/usr --enable-trace --enable-debug
--enable-peruse --enable-progress-threads

Then I started the application with two MPI processes, but it looks
like there is some problem with orte and the mpiexec just stuck there
and never run the application.
I used gdb to attach to the mpiexec to find out where the program got
stuck. The backtrace information is shown in the following for the two
MPI progresses (i.e. the rank 0 and the rank 1). It looks to me that
the problem happened in the rank 0 when it tries to do some atomic add
operation. Note that my processor is Intel Xeon CPU E5462, but the
open mpi tried to use some AMD64 instructions to conduct atomic add
operations. Is this a bug or something?

Any comment? Thank you.


The following is for the rank 0.
(gdb) bt
#0 0x00007fbdd1c93264 in opal_atomic_cmpset_32 (addr=0x7fbdd1eede24,
oldval=1, newval=0) at ../opal/include/opal/sys/amd64/atomic.h:94
#1 0x00007fbdd1c93348 in opal_atomic_add_xx (addr=0x7fbdd1eede24,
value=1, length=4) at ../opal/include/opal/sys/atomic_impl.h:243
#2 0x00007fbdd1c932ad in opal_progress () at runtime/opal_progress.c:171
#3 0x00007fbdd1f5c9ad in orte_plm_base_daemon_callback
(num_daemons=1) at base/plm_base_launch_support.c:459
#4 0x00007fbdd0a5579d in orte_plm_rsh_launch (jdata=0x60f070) at
#5 0x0000000000403821 in orterun (argc=15, argv=0x7fffda18a498) at
#6 0x0000000000402dc7 in main (argc=15, argv=0x7fffda18a498) at main.c:13
The following is for the rank 1.
#0 0x0000003c4c20b309 in pthread_cond_wait@@GLIBC_2.3.2 () from
#1 0x00007f6f8b04ba56 in opal_condition_wait (c=0x656ce0, m=0x656c88)
at ../../../../opal/threads/condition.h:78
#2 0x00007f6f8b04b8b7 in orte_rml_oob_send (peer=0x7f6f8c578978,
iov=0x7fff945798d0, count=1, tag=10, flags=16) at rml_oob_send.c:153
#3 0x00007f6f8b04c197 in orte_rml_oob_send_buffer
(peer=0x7f6f8c578978, buffer=0x6563b0, tag=10, flags=0) at
#4 0x00007f6f8c32fe24 in orte_daemon (argc=28, argv=0x7fff9457abd8)
at orted/orted_main.c:610
#5 0x0000000000400917 in main (argc=28, argv=0x7fff9457abd8) at orted.c:62