Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Paul H. Hargrove (PHHargrove_at_[hidden])
Date: 2005-12-19 18:59:09


Jeff,

   I have an FC4 x86 w/ OSCAR bits on it :-). Let me know if you want
access.

-Paul

Jeff Squyres wrote:
> Yoinks. Let me try to scrounge up an FC4 box to reproduce this on.
> If it really is an -O problem, this segv may just be the symptom, not
> the cause (seems likely, because mca_rsh_pls_component is a
> statically-defined variable -- accessing a member on it should
> definitely not cause a segv). :-(
>
>
> On Dec 18, 2005, at 12:11 PM, Greg Watson wrote:
>
>> Sure seems like it:
>>
>> (gdb) p *mca_pls_rsh_component.argv_at_4
>> $12 = {0x90e0428 "ssh", 0x90e0438 "-x", 0x0, 0x11 <Address 0x11 out
>> of bounds>}
>> (gdb) p mca_pls_rsh_component.argc
>> $13 = 2
>> (gdb) p local_exec_index
>> $14 = 3
>>
>>
>> Greg
>>
>> On Dec 18, 2005, at 4:56 AM, Rainer Keller wrote:
>>
>>> Hello Greg,
>>> I don't know, whether it's segfaulting at that particular line, but
>>> could You
>>> please print the argv, since I guess, that might be the
>>> local_exec_index
>>> into the argv being wrong?
>>>
>>> Thanks,
>>> Rainer
>>>
>>> On Saturday 17 December 2005 19:16, Greg Watson wrote:
>>>> Here's the stacktrace:
>>>>
>>>> #0 0x00ae1fe8 in orte_pls_rsh_launch (jobid=1) at
>>>> pls_rsh_module.c:714
>>>> 714 if (mca_pls_rsh_component.debug) {
>>>> (gdb) where
>>>> #0 0x00ae1fe8 in orte_pls_rsh_launch (jobid=1) at
>>>> pls_rsh_module.c:714
>>>> #1 0x00a29642 in orte_rmgr_urm_spawn ()
>>>> from /usr/local/ompi/lib/openmpi/mca_rmgr_urm.so
>>>> #2 0x0804a0d4 in orterun (argc=4, argv=0xbff88594) at orterun.c:373
>>>> #3 0x08049b16 in main (argc=4, argv=0xbff88594) at main.c:13
>>>>
>>>> And the contents of mca_pls_rsh_component:
>>>>
>>>> (gdb) p mca_pls_rsh_component
>>>> $2 = {super = {pls_version = {mca_major_version = 1,
>>>> mca_minor_version = 0,
>>>> mca_release_version = 0, mca_type_name = "pls", '\0' <repeats
>>>> 28 times>,
>>>> mca_type_major_version = 1, mca_type_minor_version = 0,
>>>> mca_type_release_version = 0,
>>>> mca_component_name = "rsh", '\0' <repeats 60 times>,
>>>> mca_component_major_version = 1,
>>>> mca_component_minor_version = 0,
>>>> mca_component_release_version = 1,
>>>> mca_open_component = 0xae0a80 <orte_pls_rsh_component_open>,
>>>> mca_close_component = 0xae09a0
>>>> <orte_pls_rsh_component_close>},
>>>> pls_data = {mca_is_checkpointable = true},
>>>> pls_init = 0xae093c <orte_pls_rsh_component_init>}, debug =
>>>> false,
>>>> reap = true, assume_same_shell = true, delay = 1, priority = 10,
>>>> argv = 0x90e0418, argc = 2, orted = 0x90de438 "orted",
>>>> path = 0x90e0960 "/usr/bin/ssh", num_children = 0, num_concurrent
>>>> = 128,
>>>> lock = {super = {obj_class = 0x804ec38, obj_reference_count = 1},
>>>> m_lock_pthread = {__data = {__lock = 0, __count = 0, __owner
>>>> = 0,
>>>> __kind = 0, __nusers = 0, __spins = 0},
>>>> __size = '\0' <repeats 23 times>, __align = 0}, m_lock_atomic
>>>> = {u = {
>>>> lock = 0, sparc_lock = 0 '\0', padding = "\000\000\000"}}},
>>>> cond = {
>>>> super = {obj_class = 0x804ec18, obj_reference_count = 1},
>>>> c_waiting = 0,
>>>> c_signaled = 0, c_cond = {__data = {__lock = 0, __futex = 0,
>>>> __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex
>>>> = 0x0,
>>>> __nwaiters = 0, __broadcast_seq = 0},
>>>> __size = '\0' <repeats 47 times>, __align = 0}}}
>>>>
>>>> I can't see why it is segfaulting at this particular line.
>>>>
>>>> Greg
>>>>
>>>> On Dec 16, 2005, at 5:55 PM, Jeff Squyres wrote:
>>>>> On Dec 16, 2005, at 10:47 AM, Greg Watson wrote:
>>>>>> I finally worked out why I couldn't reproduce the problem.
>>>>>> You're not
>>>>>> going to like it though.
>>>>> You're right -- this kind of buglet is among the most un-fun. :-(
>>>>>
>>>>>> Here's the stacktracefrom the core file:
>>>>>>
>>>>>> #0 0x00e93fe8 in orte_pls_rsh_launch ()
>>>>>> from /usr/local/ompi/lib/openmpi/mca_pls_rsh.so
>>>>>> #1 0x0023c642 in orte_rmgr_urm_spawn ()
>>>>>> from /usr/local/ompi/lib/openmpi/mca_rmgr_urm.so
>>>>>> #2 0x0804a0d4 in orterun (argc=5, argv=0xbfab2e84) at orterun.c:
>>>>>> 373
>>>>>> #3 0x08049b16 in main (argc=5, argv=0xbfab2e84) at main.c:13
>>>>> Can you recompile this one file with -g? Specifically, cd into the
>>>>> orte/mca/pla/rsh dir and "make clean". Then "make". Then cut-n-
>>>>> paste the compile line for that one file to a shell prompt, and put
>>>>> in a -g.
>>>>>
>>>>> Then either re-install that component (it looks like you're doing a
>>>>> dynamic build with separate components, so you can do "make
>>>>> install"
>>>>> right from the rsh dir) or re-link liborte and re-install that
>>>>> and re-
>>>>> run. The corefile might give something a little more meaningful in
>>>>> this case...?
>>>>>
>>>>> --
>>>>> {+} Jeff Squyres
>>>>> {+} The Open MPI Project
>>>>> {+} http://www.open-mpi.org/
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> --
>>> ---------------------------------------------------------------------
>>> Dipl.-Inf. Rainer Keller email: keller_at_[hidden]
>>> High Performance Computing Tel: ++49 (0)711-685 5858
>>> Center Stuttgart (HLRS) Fax: ++49 (0)711-678 7626
>>> POSTAL:Nobelstrasse 19 http://www.hlrs.de/people/keller
>>> ACTUAL:Allmandring 30, R. O.030
>>> 70550 Stuttgart
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> {+} Jeff Squyres
> {+} The Open MPI Project
> {+} http://www.open-mpi.org/
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900