Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Richard Graham (rlgraham_at_[hidden])
Date: 2007-06-12 09:45:55


We should not pretend that threads work in the 1.2 code branch. Thread
safety has been
 designed in, but we are just kicking off an effort to complete and verify
the thread
 safety.

Rich

On 6/11/07 2:49 PM, "Paul H. Hargrove" <PHHargrove_at_[hidden]> wrote:

> If Jeff has the resources to run threaded tests against 1.2, *and* to
> examine the results, then it might be valuable to have a summary the
> known threading issues in 1.2 written down somewhere for the benefit of
> those who don't chase the trunk.
>
> -Paul
>
> Graham, Richard L. wrote:
>> > I would second this - thread safety should be a 1.3 item, unless someone
>> has a lot of spare time.
>> >
>> > Rich
>> >
>> > -----Original Message-----
>> > From: devel-bounces_at_[hidden] <devel-bounces_at_[hidden]>
>> > To: Open MPI Developers <devel_at_[hidden]>
>> > Sent: Mon Jun 11 10:44:33 2007
>> > Subject: Re: [OMPI devel] threaded builds
>> >
>> >
>> > On Jun 11, 2007, at 8:25 AM, Jeff Squyres wrote:
>> >
>> >
>>> >> I leave it to the thread subgroup to decide... Should we discuss on
>>> >> the call tomorrow?
>>> >>
>>> >> I don't have a strong opinion; I was just testing both because it was
>>> >> easy to do so. If we want to concentrate on the trunk, I can adjust
>>> >> my MTT setup.
>>> >>
>>> >>
>> >
>> > I think trying to worry about 1.2 would just be a time sink. We know
>> > that there are architectural issues with threads in some parts of the
>> > code. I don't see us re-architecting 1.2 in this regard.
>> > Seems we should only focus on the trunk.
>> >
>> >
>> > - Galen
>> >
>> >
>> >
>>> >> On Jun 11, 2007, at 10:17 AM, Brian Barrett wrote:
>>> >>
>>> >>
>>>> >>> Yes, this is a known issue. I don't know -- are we trying to make
>>>> >>> threads work on the 1.2 branch, or just the trunk? I had thought
>>>> >>> just the trunk?
>>>> >>>
>>>> >>> Brian
>>>> >>>
>>>> >>>
>>>> >>> On Jun 11, 2007, at 8:13 AM, Tim Prins wrote:
>>>> >>>
>>>> >>>
>>>>> >>>> I had similar problems on the trunk, which was fixed by Brian with
>>>>> >>>> r14877.
>>>>> >>>>
>>>>> >>>> Perhaps 1.2 needs something similar?
>>>>> >>>>
>>>>> >>>> Tim
>>>>> >>>>
>>>>> >>>> On Monday 11 June 2007 10:08:15 am Jeff Squyres wrote:
>>>>> >>>>
>>>>>> >>>>> Per the teleconf last week, I have started to revamp the Cisco MTT
>>>>>> >>>>> infrastructure to do simplistic thread testing. Specifically, I'm
>>>>>> >>>>> building the OMPI trunk and v1.2 branches with "--with-threads --
>>>>>> >>>>> enable-mpi-threads".
>>>>>> >>>>>
>>>>>> >>>>> I haven't switched this into my production MTT setup yet, but in
>>>>>> >>>>> the
>>>>>> >>>>> first trial runs, I'm noticing a segv in the test/threads/
>>>>>> >>>>> opal_condition program.
>>>>>> >>>>>
>>>>>> >>>>> It seems that in the thr1 test on the v1.2 branch, when it calls
>>>>>> >>>>> opal_progress() underneath the condition variable wait, at some
>>>>>> >>>>> point
>>>>>> >>>>> in there current_base is getting to be NULL. Hence, the following
>>>>>> >>>>> segv's because the passed in value of "base" is NULL (event.c):
>>>>>> >>>>>
>>>>>> >>>>> int
>>>>>> >>>>> opal_event_base_loop(struct event_base *base, int flags)
>>>>>> >>>>> {
>>>>>> >>>>> const struct opal_eventop *evsel = base->evsel;
>>>>>> >>>>> ...
>>>>>> >>>>>
>>>>>> >>>>> Here's the full call stack:
>>>>>> >>>>>
>>>>>> >>>>> #0 0x0000002a955a020e in opal_event_base_loop (base=0x0, flags=5)
>>>>>> >>>>> at event.c:520
>>>>>> >>>>> #1 0x0000002a955a01f9 in opal_event_loop (flags=5) at event.c:514
>>>>>> >>>>> #2 0x0000002a95599111 in opal_progress () at runtime/
>>>>>> >>>>> opal_progress.c:
>>>>>> >>>>> 259
>>>>>> >>>>> #3 0x00000000004012c8 in opal_condition_wait (c=0x5025a0,
>>>>>> >>>>> m=0x502600)
>>>>>> >>>>> at ../../opal/threads/condition.h:81
>>>>>> >>>>> #4 0x0000000000401146 in thr1_run (obj=0x503110) at
>>>>>> >>>>> opal_condition.c:46
>>>>>> >>>>> #5 0x00000036e290610a in start_thread () from /lib64/tls/
>>>>>> >>>>> libpthread.so.0
>>>>>> >>>>> #6 0x00000036e1ec68c3 in clone () from /lib64/tls/libc.so.6
>>>>>> >>>>> #7 0x0000000000000000 in ?? ()
>>>>>> >>>>>
>>>>>> >>>>> This test seems to work fine on the trunk (at least, it didn't segv
>>>>>> >>>>> in my small number of trail runs).
>>>>>> >>>>>
>>>>>> >>>>> Is this a known problem in the 1.2 branch? Should I skip the
>>>>>> >>>>> thread
>>>>>> >>>>> testing on the 1.2 branch and concentrate on the trunk?
>>>>>> >>>>>
>>>>> >>>> _______________________________________________
>>>>> >>>> devel mailing list
>>>>> >>>> devel_at_[hidden]
>>>>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> >>>>
>>>> >>> _______________________________________________
>>>> >>> devel mailing list
>>>> >>> devel_at_[hidden]
>>>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> >>>
>>> >> --
>>> >> Jeff Squyres
>>> >> Cisco Systems
>>> >>
>>> >> _______________________________________________
>>> >> devel mailing list
>>> >> devel_at_[hidden]
>>> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> >>
>> >
>> > _______________________________________________
>> > devel mailing list
>> > devel_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >
>> > _______________________________________________
>> > devel mailing list
>> > devel_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> HPC Research Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>