Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-12-19 12:05:05


Yoinks. Let me try to scrounge up an FC4 box to reproduce this on.
If it really is an -O problem, this segv may just be the symptom, not
the cause (seems likely, because mca_rsh_pls_component is a
statically-defined variable -- accessing a member on it should
definitely not cause a segv). :-(

On Dec 18, 2005, at 12:11 PM, Greg Watson wrote:

> Sure seems like it:
>
> (gdb) p *mca_pls_rsh_component.argv_at_4
> $12 = {0x90e0428 "ssh", 0x90e0438 "-x", 0x0, 0x11 <Address 0x11 out
> of bounds>}
> (gdb) p mca_pls_rsh_component.argc
> $13 = 2
> (gdb) p local_exec_index
> $14 = 3
>
>
> Greg
>
> On Dec 18, 2005, at 4:56 AM, Rainer Keller wrote:
>
>> Hello Greg,
>> I don't know, whether it's segfaulting at that particular line, but
>> could You
>> please print the argv, since I guess, that might be the
>> local_exec_index
>> into the argv being wrong?
>>
>> Thanks,
>> Rainer
>>
>> On Saturday 17 December 2005 19:16, Greg Watson wrote:
>>> Here's the stacktrace:
>>>
>>> #0 0x00ae1fe8 in orte_pls_rsh_launch (jobid=1) at
>>> pls_rsh_module.c:714
>>> 714 if (mca_pls_rsh_component.debug) {
>>> (gdb) where
>>> #0 0x00ae1fe8 in orte_pls_rsh_launch (jobid=1) at
>>> pls_rsh_module.c:714
>>> #1 0x00a29642 in orte_rmgr_urm_spawn ()
>>> from /usr/local/ompi/lib/openmpi/mca_rmgr_urm.so
>>> #2 0x0804a0d4 in orterun (argc=4, argv=0xbff88594) at orterun.c:373
>>> #3 0x08049b16 in main (argc=4, argv=0xbff88594) at main.c:13
>>>
>>> And the contents of mca_pls_rsh_component:
>>>
>>> (gdb) p mca_pls_rsh_component
>>> $2 = {super = {pls_version = {mca_major_version = 1,
>>> mca_minor_version = 0,
>>> mca_release_version = 0, mca_type_name = "pls", '\0' <repeats
>>> 28 times>,
>>> mca_type_major_version = 1, mca_type_minor_version = 0,
>>> mca_type_release_version = 0,
>>> mca_component_name = "rsh", '\0' <repeats 60 times>,
>>> mca_component_major_version = 1,
>>> mca_component_minor_version = 0,
>>> mca_component_release_version = 1,
>>> mca_open_component = 0xae0a80 <orte_pls_rsh_component_open>,
>>> mca_close_component = 0xae09a0
>>> <orte_pls_rsh_component_close>},
>>> pls_data = {mca_is_checkpointable = true},
>>> pls_init = 0xae093c <orte_pls_rsh_component_init>}, debug =
>>> false,
>>> reap = true, assume_same_shell = true, delay = 1, priority = 10,
>>> argv = 0x90e0418, argc = 2, orted = 0x90de438 "orted",
>>> path = 0x90e0960 "/usr/bin/ssh", num_children = 0, num_concurrent
>>> = 128,
>>> lock = {super = {obj_class = 0x804ec38, obj_reference_count = 1},
>>> m_lock_pthread = {__data = {__lock = 0, __count = 0, __owner
>>> = 0,
>>> __kind = 0, __nusers = 0, __spins = 0},
>>> __size = '\0' <repeats 23 times>, __align = 0}, m_lock_atomic
>>> = {u = {
>>> lock = 0, sparc_lock = 0 '\0', padding = "\000\000\000"}}},
>>> cond = {
>>> super = {obj_class = 0x804ec18, obj_reference_count = 1},
>>> c_waiting = 0,
>>> c_signaled = 0, c_cond = {__data = {__lock = 0, __futex = 0,
>>> __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex
>>> = 0x0,
>>> __nwaiters = 0, __broadcast_seq = 0},
>>> __size = '\0' <repeats 47 times>, __align = 0}}}
>>>
>>> I can't see why it is segfaulting at this particular line.
>>>
>>> Greg
>>>
>>> On Dec 16, 2005, at 5:55 PM, Jeff Squyres wrote:
>>>> On Dec 16, 2005, at 10:47 AM, Greg Watson wrote:
>>>>> I finally worked out why I couldn't reproduce the problem.
>>>>> You're not
>>>>> going to like it though.
>>>>
>>>> You're right -- this kind of buglet is among the most un-fun. :-(
>>>>
>>>>> Here's the stacktracefrom the core file:
>>>>>
>>>>> #0 0x00e93fe8 in orte_pls_rsh_launch ()
>>>>> from /usr/local/ompi/lib/openmpi/mca_pls_rsh.so
>>>>> #1 0x0023c642 in orte_rmgr_urm_spawn ()
>>>>> from /usr/local/ompi/lib/openmpi/mca_rmgr_urm.so
>>>>> #2 0x0804a0d4 in orterun (argc=5, argv=0xbfab2e84) at orterun.c:
>>>>> 373
>>>>> #3 0x08049b16 in main (argc=5, argv=0xbfab2e84) at main.c:13
>>>>
>>>> Can you recompile this one file with -g? Specifically, cd into the
>>>> orte/mca/pla/rsh dir and "make clean". Then "make". Then cut-n-
>>>> paste the compile line for that one file to a shell prompt, and put
>>>> in a -g.
>>>>
>>>> Then either re-install that component (it looks like you're doing a
>>>> dynamic build with separate components, so you can do "make
>>>> install"
>>>> right from the rsh dir) or re-link liborte and re-install that
>>>> and re-
>>>> run. The corefile might give something a little more meaningful in
>>>> this case...?
>>>>
>>>> --
>>>> {+} Jeff Squyres
>>>> {+} The Open MPI Project
>>>> {+} http://www.open-mpi.org/
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> --
>> ---------------------------------------------------------------------
>> Dipl.-Inf. Rainer Keller email: keller_at_[hidden]
>> High Performance Computing Tel: ++49 (0)711-685 5858
>> Center Stuttgart (HLRS) Fax: ++49 (0)711-678 7626
>> POSTAL:Nobelstrasse 19 http://www.hlrs.de/people/keller
>> ACTUAL:Allmandring 30, R. O.030
>> 70550 Stuttgart
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/