Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ralph H Castain (rhc_at_[hidden])
Date: 2007-03-15 11:04:07


I can't speak to the MPI problems mentioned in here as my area of focus is
solely on the RTE. With that caveat, I can say that - despite the fact there
is little thread safety testing in the system - I haven't heard of any
trouble launching non-MPI apps. We do it regularly, in both threaded and
non-threaded builds, on a wide variety of clusters and smp's...although I
confess that I personally build with --disable-progress-thread and other
threading options "off" given the state of thread safety testing.

That said, there are several known problems in the 1.1.x code series that
can result in the system "hanging". For example, if the system is unable to
locate the specified application or lacks permissions to execute it on the
remote node, and the rsh launcher is being used, then it can result in your
described behavior.

We have made considerable improvement in that regard in the 1.2 release that
is expected out momentarily. I've been told that there are no plans to
provide any more bug fixes for the 1.1 code series - it will basically end
with the upcoming 1.1.5 release, which does *not* contain fixes for problems
such as the example I described.

If you can, therefore, I would suggest upgrading to the 1.2 release (the
final release candidate is on the site - the official release looks like it
will be identical to that candidate).

I'll have to let the team members who focus on the MPI layer address the
other problems you mentioned.

Ralph

On 3/13/07 4:31 AM, "David Minor" <david-m_at_[hidden]> wrote:

> Sounds like bad news about the threading. That's probably what's hanging me as
> well. We're running clusters of multi-core smp's, our app NEEDS
> multi-threading. It'd be nice to get an "official" reply on this from someone
> on the dev team.
> -David
>
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf
> Of Mike Houston
> Sent: Tuesday, March 13, 2007 5:52 AM
> To: Open MPI Users
> Subject: [OMPI users] Fun with threading
>
> At least with 1.1.4, I'm having a heck of a time with enabling
> multi-threading. Configuring with --with-threads=posix
> --enable-mpi-threads --enable-progress-threads leads to mpirun just
> hanging, even when not launching MPI apps, i.e. mpirun -np 1 hostname,
> and I can't crtl-c to kill it, I have to kill -9 it. Removing progress
> threads support results in the same behavior. Removing
> --enable-mpi-threads gets mpirun working again, but not the thread
> protection I need.
>
> What is the status for multi thread support? It looks like it's still
> largely untested from my reading of the mailing lists. We actually have
> an application that would be much easier to deal with if we could have
> two threads in a process both using MPI. Funneling everything through a
> single processor creates a locking nightmare, and generally means we
> will be forced to spin checking a IRecv and the status of a data
> structure instead of having one thread happily sitting on a blocking
> receive and the other watching the data structure, basically pissing
> away a processor that we could be using to do something useful. (We are
> basically doing a simplified version of DSM and we need to respond to
> remote data requests).
>
> At the moment, it seems that when running without threading support
> enabled, if we only post a receive on a single thread, things are mostly
> happy, except if one thread in process sends to the other thread in the
> same process who has posted a receive. Under TCP, the send fails with:
>
> *** An error occurred in MPI_Send
> *** on communicator MPI_COMM_WORLD
> *** MPI_ERR_INTERN: internal error
> *** MPI_ERRORS_ARE_FATAL (goodbye)
> [0,0,0]-[0,1,0] mca_oob_tcp_msg_recv: readv failed with errno=104
>
> SM has undefined results.
>
> Obviously I'm playing fast and loose, which is why I'm attempting to get
> threading support to work to see if it solve the headaches. If you
> really want to have some fun, have a posted MPI_Recv on one thread and
> issue an MPI_Barrier on the other (with SM):
>
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0x1c
> [0] func:/usr/lib/libopal.so.0 [0xc030f4]
> [1] func:/lib/tls/libpthread.so.0 [0x46f93890]
> [2]
> func:/usr/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_match+0xb08)
> [0x14ec38]
> [3]
> func:/usr/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback+0x2f9)
> [0x14f7e9]
> [4]
> func:/usr/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0xa87)
> [0x806c07]
> [5] func:/usr/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x39) [0x510c69]
> [6] func:/usr/lib/libopal.so.0(opal_progress+0x69) [0xbecc39]
> [7] func:/usr/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x785) [0x14d675]
> [8]
> func:/usr/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_sendrecv_actual_localc
> ompleted+0x8c)
> [0x5cc3fc]
> [9]
> func:/usr/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_two_proc
> s+0x76)
> [0x5ceef6]
> [10]
> func:/usr/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_dec_fixe
> d+0x38)
> [0x5cc638]
> [11] func:/usr/lib/libmpi.so.0(PMPI_Barrier+0xe9) [0x29a1b9]
>
> -Mike
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users