Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Greg Watson (gwatson_at_[hidden])
Date: 2005-12-21 16:11:56


I just tried 1.0.2a1r8580 but the problem is still there...

Greg

On Dec 20, 2005, at 5:02 PM, Jeff Squyres wrote:

> I think we found the problem and committed a fix this afternoon to
> both the trunk and v1.0 branch. Anything after r8564 should have the
> fix.
>
> Greg -- could you try again?
>
>
> On Dec 19, 2005, at 4:59 PM, Paul H. Hargrove wrote:
>
>> Jeff,
>>
>> I have an FC4 x86 w/ OSCAR bits on it :-). Let me know if you
>> want
>> access.
>>
>> -Paul
>>
>> Jeff Squyres wrote:
>>> Yoinks. Let me try to scrounge up an FC4 box to reproduce this on.
>>> If it really is an -O problem, this segv may just be the symptom,
>>> not
>>> the cause (seems likely, because mca_rsh_pls_component is a
>>> statically-defined variable -- accessing a member on it should
>>> definitely not cause a segv). :-(
>>>
>>>
>>> On Dec 18, 2005, at 12:11 PM, Greg Watson wrote:
>>>
>>>> Sure seems like it:
>>>>
>>>> (gdb) p *mca_pls_rsh_component.argv_at_4
>>>> $12 = {0x90e0428 "ssh", 0x90e0438 "-x", 0x0, 0x11 <Address 0x11 out
>>>> of bounds>}
>>>> (gdb) p mca_pls_rsh_component.argc
>>>> $13 = 2
>>>> (gdb) p local_exec_index
>>>> $14 = 3
>>>>
>>>>
>>>> Greg
>>>>
>>>> On Dec 18, 2005, at 4:56 AM, Rainer Keller wrote:
>>>>
>>>>> Hello Greg,
>>>>> I don't know, whether it's segfaulting at that particular line,
>>>>> but
>>>>> could You
>>>>> please print the argv, since I guess, that might be the
>>>>> local_exec_index
>>>>> into the argv being wrong?
>>>>>
>>>>> Thanks,
>>>>> Rainer
>>>>>
>>>>> On Saturday 17 December 2005 19:16, Greg Watson wrote:
>>>>>> Here's the stacktrace:
>>>>>>
>>>>>> #0 0x00ae1fe8 in orte_pls_rsh_launch (jobid=1) at
>>>>>> pls_rsh_module.c:714
>>>>>> 714 if (mca_pls_rsh_component.debug) {
>>>>>> (gdb) where
>>>>>> #0 0x00ae1fe8 in orte_pls_rsh_launch (jobid=1) at
>>>>>> pls_rsh_module.c:714
>>>>>> #1 0x00a29642 in orte_rmgr_urm_spawn ()
>>>>>> from /usr/local/ompi/lib/openmpi/mca_rmgr_urm.so
>>>>>> #2 0x0804a0d4 in orterun (argc=4, argv=0xbff88594) at
>>>>>> orterun.c:373
>>>>>> #3 0x08049b16 in main (argc=4, argv=0xbff88594) at main.c:13
>>>>>>
>>>>>> And the contents of mca_pls_rsh_component:
>>>>>>
>>>>>> (gdb) p mca_pls_rsh_component
>>>>>> $2 = {super = {pls_version = {mca_major_version = 1,
>>>>>> mca_minor_version = 0,
>>>>>> mca_release_version = 0, mca_type_name = "pls", '\0'
>>>>>> <repeats
>>>>>> 28 times>,
>>>>>> mca_type_major_version = 1, mca_type_minor_version = 0,
>>>>>> mca_type_release_version = 0,
>>>>>> mca_component_name = "rsh", '\0' <repeats 60 times>,
>>>>>> mca_component_major_version = 1,
>>>>>> mca_component_minor_version = 0,
>>>>>> mca_component_release_version = 1,
>>>>>> mca_open_component = 0xae0a80
>>>>>> <orte_pls_rsh_component_open>,
>>>>>> mca_close_component = 0xae09a0
>>>>>> <orte_pls_rsh_component_close>},
>>>>>> pls_data = {mca_is_checkpointable = true},
>>>>>> pls_init = 0xae093c <orte_pls_rsh_component_init>}, debug =
>>>>>> false,
>>>>>> reap = true, assume_same_shell = true, delay = 1, priority =
>>>>>> 10,
>>>>>> argv = 0x90e0418, argc = 2, orted = 0x90de438 "orted",
>>>>>> path = 0x90e0960 "/usr/bin/ssh", num_children = 0,
>>>>>> num_concurrent
>>>>>> = 128,
>>>>>> lock = {super = {obj_class = 0x804ec38, obj_reference_count
>>>>>> = 1},
>>>>>> m_lock_pthread = {__data = {__lock = 0, __count = 0, __owner
>>>>>> = 0,
>>>>>> __kind = 0, __nusers = 0, __spins = 0},
>>>>>> __size = '\0' <repeats 23 times>, __align = 0},
>>>>>> m_lock_atomic
>>>>>> = {u = {
>>>>>> lock = 0, sparc_lock = 0 '\0', padding = "\000\000
>>>>>> \000"}}},
>>>>>> cond = {
>>>>>> super = {obj_class = 0x804ec18, obj_reference_count = 1},
>>>>>> c_waiting = 0,
>>>>>> c_signaled = 0, c_cond = {__data = {__lock = 0, __futex = 0,
>>>>>> __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0,
>>>>>> __mutex
>>>>>> = 0x0,
>>>>>> __nwaiters = 0, __broadcast_seq = 0},
>>>>>> __size = '\0' <repeats 47 times>, __align = 0}}}
>>>>>>
>>>>>> I can't see why it is segfaulting at this particular line.
>>>>>>
>>>>>> Greg
>>>>>>
>>>>>> On Dec 16, 2005, at 5:55 PM, Jeff Squyres wrote:
>>>>>>> On Dec 16, 2005, at 10:47 AM, Greg Watson wrote:
>>>>>>>> I finally worked out why I couldn't reproduce the problem.
>>>>>>>> You're not
>>>>>>>> going to like it though.
>>>>>>> You're right -- this kind of buglet is among the most un-
>>>>>>> fun. :-(
>>>>>>>
>>>>>>>> Here's the stacktracefrom the core file:
>>>>>>>>
>>>>>>>> #0 0x00e93fe8 in orte_pls_rsh_launch ()
>>>>>>>> from /usr/local/ompi/lib/openmpi/mca_pls_rsh.so
>>>>>>>> #1 0x0023c642 in orte_rmgr_urm_spawn ()
>>>>>>>> from /usr/local/ompi/lib/openmpi/mca_rmgr_urm.so
>>>>>>>> #2 0x0804a0d4 in orterun (argc=5, argv=0xbfab2e84) at
>>>>>>>> orterun.c:
>>>>>>>> 373
>>>>>>>> #3 0x08049b16 in main (argc=5, argv=0xbfab2e84) at main.c:13
>>>>>>> Can you recompile this one file with -g? Specifically, cd
>>>>>>> into the
>>>>>>> orte/mca/pla/rsh dir and "make clean". Then "make". Then
>>>>>>> cut-n-
>>>>>>> paste the compile line for that one file to a shell prompt,
>>>>>>> and put
>>>>>>> in a -g.
>>>>>>>
>>>>>>> Then either re-install that component (it looks like you're
>>>>>>> doing a
>>>>>>> dynamic build with separate components, so you can do "make
>>>>>>> install"
>>>>>>> right from the rsh dir) or re-link liborte and re-install that
>>>>>>> and re-
>>>>>>> run. The corefile might give something a little more
>>>>>>> meaningful in
>>>>>>> this case...?
>>>>>>>
>>>>>>> --
>>>>>>> {+} Jeff Squyres
>>>>>>> {+} The Open MPI Project
>>>>>>> {+} http://www.open-mpi.org/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> --
>>>>> ------------------------------------------------------------------
>>>>> -
>>>>> --
>>>>> Dipl.-Inf. Rainer Keller email: keller_at_[hidden]
>>>>> High Performance Computing Tel: ++49 (0)711-685 5858
>>>>> Center Stuttgart (HLRS) Fax: ++49 (0)711-678 7626
>>>>> POSTAL:Nobelstrasse 19 http://www.hlrs.de/people/
>>>>> keller
>>>>> ACTUAL:Allmandring 30, R. O.030
>>>>> 70550 Stuttgart
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> --
>>> {+} Jeff Squyres
>>> {+} The Open MPI Project
>>> {+} http://www.open-mpi.org/
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden]
>> Future Technologies Group
>> HPC Research Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> {+} Jeff Squyres
> {+} The Open MPI Project
> {+} http://www.open-mpi.org/
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel