Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Gleb Natapov (glebn_at_[hidden])
Date: 2007-09-11 11:28:21


On Tue, Sep 11, 2007 at 10:00:07AM -0500, Edgar Gabriel wrote:
> Gleb,
>
> in the scenario which you describe in the comment to the patch, what
> should happen is, that the communicator with the cid which started
> already the allreduce will basically 'hang' until the other processes
> 'allow' the lower cids to continue. It should basically be blocked in
> the allreduce.
Why? Two threads are allowed to run allreduce simultaneously for different
communicators. Are they?

>
> However, here is something, where we might have problems with the sun
> thread tests (and we discussed that with Terry already): the cid
> allocation algorithm as implemented in Open MPI assumes ( -- this was/is
> my/our understanding of the standard --) that a communicator creation is
> a collective operation. This means, you can not have a comm_create and
> another allreduce of the same communicator running in different threads,
> because these allreduces will mix up and produce non-sense results. We
> fixed the case, if all collective operations are comm_creates, but if
> some of the threads are in a comm_create and some are in allreduce on
> the same communicator, it won't work.
Correct, but this is not what happens with mt_coll test. mt_coll calls
commdup on the same communicator in different threads concurrently, but
we handle this case inside ompi_comm_nextcid().

>
>
> Gleb Natapov wrote:
> > On Tue, Sep 11, 2007 at 10:14:30AM -0400, George Bosilca wrote:
> >> Gleb,
> >>
> >> This patch is not correct. The code preventing the registration of the same
> >> communicator twice is later in the code (same file in the function
> >> ompi_comm_register_cid line 326). Once the function ompi_comm_register_cid
> > I saw this code and the comment. The problem is not with the same
> > communicator but with different communicators.
> >
> >> is called, we know that each communicator only handle one "communicator
> >> creation" function at the same time. Therefore, we want to give priority to
> >> the smallest com_id, which is what happens in the code you removed.
> > The code I removed was doing it wrongly. I.e the algorithm sometimes is executed
> > for different communicators simultaneously by different threads. Think
> > about the case where the function is running for cid 1 and then another
> > thread runs it for cid 0. cid 0 will proceed although the function is
> > executed on another CPU. And this is not something theoretical, that
> > is happening with sun's thread test suit mpi_coll test case.
> >
> >> Without the condition in the ompi_comm_register_cid (each communicator only
> >> get registered once) your comment make sense. However, with the condition
> >> your patch allow a dead end situation, while 2 processes try to create
> >> communicators in multiple threads, and they will never succeed, simply
> >> because they will not order the creation based on the com_id.
> > If the algorithm is really prone to deadlock in case it is concurrently
> > executed for several different communicators (I haven't check this),
> > then we may want to fix original code to really prevent two threads to
> > enter the function, but then I don't see the reason for all those
> > complications with ompi_comm_register_cid()/ompi_comm_unregister_cid()
> > The algorithm described here:
> > http://209.85.129.104/search?q=cache:5PV5MMRkBWkJ:ftp://info.mcs.anl.gov/pub/tech_reports/reports/P1382.pdf+MPI+communicator+dup+algorithm&hl=en&ct=clnk&cd=2
> > in section 5.3 works without it and we can do something similar.
> >
> >> george.
> >>
> >>
> >>
> >> On Sep 11, 2007, at 9:23 AM, gleb_at_[hidden] wrote:
> >>
> >>> Author: gleb
> >>> Date: 2007-09-11 09:23:46 EDT (Tue, 11 Sep 2007)
> >>> New Revision: 16088
> >>> URL: https://svn.open-mpi.org/trac/ompi/changeset/16088
> >>>
> >>> Log:
> >>> The code tries to prevent itself from running for more then one
> >>> communicator
> >>> simultaneously, but is doing it incorrectly. If the function is running
> >>> already
> >>> for one communicator and it is called from another thread for other
> >>> communicator
> >>> with lower cid the check comm->c_contextid != ompi_comm_lowest_cid()
> >>> will fail and the function will be executed for two different
> >>> communicators by
> >>> two threads simultaneously. There is nothing in the algorithm that prevent
> >>> it
> >>> from been running simultaneously for different communicators as far as I
> >>> can see,
> >>> but ompi_comm_unregister_cid() assumes that it is always called for a
> >>> communicator
> >>> with the lowest cid and this is not always the case. This patch removes
> >>> bogus
> >>> lowest cid check and fix ompi_comm_register_cid() to properly remove cid
> >>> from
> >>> the list.
> >>>
> >>> Text files modified:
> >>> trunk/ompi/communicator/comm_cid.c | 24 ++++++++++++------------
> >>> 1 files changed, 12 insertions(+), 12 deletions(-)
> >>>
> >>> Modified: trunk/ompi/communicator/comm_cid.c
> >>> ==============================================================================
> >>> --- trunk/ompi/communicator/comm_cid.c (original)
> >>> +++ trunk/ompi/communicator/comm_cid.c 2007-09-11 09:23:46 EDT (Tue, 11
> >>> Sep 2007)
> >>> @@ -11,6 +11,7 @@
> >>> * All rights reserved.
> >>> * Copyright (c) 2006-2007 University of Houston. All rights reserved.
> >>> * Copyright (c) 2007 Cisco, Inc. All rights reserved.
> >>> + * Copyright (c) 2007 Voltaire All rights reserved.
> >>> * $COPYRIGHT$
> >>> *
> >>> * Additional copyrights may follow
> >>> @@ -170,15 +171,6 @@
> >>> * This is the real algorithm described in the doc
> >>> */
> >>>
> >>> - OPAL_THREAD_LOCK(&ompi_cid_lock);
> >>> - if (comm->c_contextid != ompi_comm_lowest_cid() ) {
> >>> - /* if not lowest cid, we do not continue, but sleep and
> >>> try again */
> >>> - OPAL_THREAD_UNLOCK(&ompi_cid_lock);
> >>> - continue;
> >>> - }
> >>> - OPAL_THREAD_UNLOCK(&ompi_cid_lock);
> >>> -
> >>> -
> >>> for (i=start; i < mca_pml.pml_max_contextid ; i++) {
> >>>
> >>> flag=ompi_pointer_array_test_and_set_item(&ompi_mpi_communicators,
> >>> i, comm);
> >>> @@ -365,10 +357,18 @@
> >>>
> >>> static int ompi_comm_unregister_cid (uint32_t cid)
> >>> {
> >>> - ompi_comm_reg_t *regcom=NULL;
> >>> - opal_list_item_t
> >>> *item=opal_list_remove_first(&ompi_registered_comms);
> >>> + ompi_comm_reg_t *regcom;
> >>> + opal_list_item_t *item;
> >>>
> >>> - regcom = (ompi_comm_reg_t *) item;
> >>> + for (item = opal_list_get_first(&ompi_registered_comms);
> >>> + item != opal_list_get_end(&ompi_registered_comms);
> >>> + item = opal_list_get_next(item)) {
> >>> + regcom = (ompi_comm_reg_t *)item;
> >>> + if(regcom->cid == cid) {
> >>> + opal_list_remove_item(&ompi_registered_comms, item);
> >>> + break;
> >>> + }
> >>> + }
> >>> OBJ_RELEASE(regcom);
> >>> return OMPI_SUCCESS;
> >>> }
> >>> _______________________________________________
> >>> svn-full mailing list
> >>> svn-full_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full
> >
> >
> >
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > --
> > Gleb.
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> --
> Edgar Gabriel
> Assistant Professor
> Parallel Software Technologies Lab http://pstl.cs.uh.edu
> Department of Computer Science University of Houston
> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
			Gleb.