Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] problem with progress thread and orte
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-01-12 13:28:05


Without progress threads, you can only recv messages when you call a function in the OMPI library - e.g., when you send something. In addition, you only recv -one- message for each time you call into the library.

With progress threads, you recv messages when they arrive, even if you aren't in the OMPI library at that time...but this doesn't work right now.

On Jan 11, 2010, at 10:31 PM, Sangamesh B wrote:

> Hi,
>
> What are the advantages with progress-threads feature?
>
> Thanks,
> Sangamesh
>
> On Fri, Jan 8, 2010 at 10:13 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> Yeah, the system doesn't currently support enable-progress-threads. It is a two-fold problem: ORTE won't work that way, and some parts of the MPI layer won't either.
>
> I am currently working on fixing ORTE so it will work with progress threads enabled. I believe (but can't confirm) that the TCP BTL will also work with that feature, but I have heard that the other BTL's won't (again, can't confirm).
>
> I'll send out a note when ORTE is okay, but that won't be included in a release for awhile.
>
> On Jan 8, 2010, at 9:38 AM, Dong Li wrote:
>
> > Hi, guys.
> > My application got stuck when I run an application with Open MPI 1.4
> > with progress thead enabled.
> >
> > The OpenMPI is configured and compiled with the following options.
> > ./configure --with-openib=/usr --enable-trace --enable-debug
> > --enable-peruse --enable-progress-threads
> >
> > Then I started the application with two MPI processes, but it looks
> > like there is some problem with orte and the mpiexec just stuck there
> > and never run the application.
> > I used gdb to attach to the mpiexec to find out where the program got
> > stuck. The backtrace information is shown in the following for the two
> > MPI progresses (i.e. the rank 0 and the rank 1). It looks to me that
> > the problem happened in the rank 0 when it tries to do some atomic add
> > operation. Note that my processor is Intel Xeon CPU E5462, but the
> > open mpi tried to use some AMD64 instructions to conduct atomic add
> > operations. Is this a bug or something?
> >
> > Any comment? Thank you.
> >
> > -Dong
> >
> >
> > ***********************************************************************************************************************************************
> > The following is for the rank 0.
> > (gdb) bt
> > #0 0x00007fbdd1c93264 in opal_atomic_cmpset_32 (addr=0x7fbdd1eede24,
> > oldval=1, newval=0) at ../opal/include/opal/sys/amd64/atomic.h:94
> > #1 0x00007fbdd1c93348 in opal_atomic_add_xx (addr=0x7fbdd1eede24,
> > value=1, length=4) at ../opal/include/opal/sys/atomic_impl.h:243
> > #2 0x00007fbdd1c932ad in opal_progress () at runtime/opal_progress.c:171
> > #3 0x00007fbdd1f5c9ad in orte_plm_base_daemon_callback
> > (num_daemons=1) at base/plm_base_launch_support.c:459
> > #4 0x00007fbdd0a5579d in orte_plm_rsh_launch (jdata=0x60f070) at
> > plm_rsh_module.c:1221
> > #5 0x0000000000403821 in orterun (argc=15, argv=0x7fffda18a498) at
> > orterun.c:748
> > #6 0x0000000000402dc7 in main (argc=15, argv=0x7fffda18a498) at main.c:13
> > ************************************************************************************************************************************************
> > The following is for the rank 1.
> > #0 0x0000003c4c20b309 in pthread_cond_wait@@GLIBC_2.3.2 () from
> > /lib64/libpthread.so.0
> > #1 0x00007f6f8b04ba56 in opal_condition_wait (c=0x656ce0, m=0x656c88)
> > at ../../../../opal/threads/condition.h:78
> > #2 0x00007f6f8b04b8b7 in orte_rml_oob_send (peer=0x7f6f8c578978,
> > iov=0x7fff945798d0, count=1, tag=10, flags=16) at rml_oob_send.c:153
> > #3 0x00007f6f8b04c197 in orte_rml_oob_send_buffer
> > (peer=0x7f6f8c578978, buffer=0x6563b0, tag=10, flags=0) at
> > rml_oob_send.c:269
> > #4 0x00007f6f8c32fe24 in orte_daemon (argc=28, argv=0x7fff9457abd8)
> > at orted/orted_main.c:610
> > #5 0x0000000000400917 in main (argc=28, argv=0x7fff9457abd8) at orted.c:62
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users