Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Openmpi 1.6.5 is freezing under GNU/Linux ia64
From: Sylvestre Ledru (sylvestre_at_[hidden])
Date: 2013-11-16 05:33:34


On 15/11/2013 17:50, Ralph Castain wrote:
> Hmm...well, that will make debug a tad more difficult. I've attached a
> patch that *should* stop the segfault. Given that behavior, though, it
> looks like the system isn't finding either rsh or ssh on your machine.
> Might be the root cause of the problem.
With your patch:
$ ./mpirun -mca plm_base_verbose 5 -mca ras_base_verbose 5
-mcarmaps_base_verbose 5 -mca ess_base_verbose 5 -c 4 foo
[merulo:08821] mca:base:select:( plm) Querying component [rsh]
[merulo:08821] [[INVALID],INVALID] plm:base:rsh_lookup on agent ssh :
rsh path NULL
[merulo:08821] *** Process received signal ***
[merulo:08821] Signal: Segmentation fault (11)
[merulo:08821] Signal code: Invalid permissions (2)
[merulo:08821] Failing at address: (nil)
[merulo:08821] [ 0]
linux-gate.so.1(__kernel_sigtramp+0x7fffffffff886860) [0xa000000000040800]
[merulo:08821] [ 1]
/home/sylvestre/bogus2/lib/openmpi/mca_plm_rsh.so(orte_plm_rsh_component_query+0xae3b0)
[0x2000000000867f30]
[merulo:08821] [ 2]
/home/sylvestre/bogus2/lib/libopen-rte.so.4(mca_base_select-0x5dc110)
[0x20000000001ddea0]
[merulo:08821] [ 3]
/home/sylvestre/bogus2/lib/libopen-rte.so.4(orte_plm_base_select-0x680cd0)
[0x20000000001392f0]
[merulo:08821] [ 4]
/home/sylvestre/bogus2/lib/openmpi/mca_ess_hnp.so(+0x56f0)
[0x20000000008316f0]
[merulo:08821] [ 5]
/home/sylvestre/bogus2/lib/libopen-rte.so.4(orte_init-0x72bf10)
[0x200000000008e0c0]
[merulo:08821] [ 6] ./mpirun(orterun+0x1fffffffff84cc80)
[0x4000000000006c60]
[merulo:08821] [ 7] ./mpirun(main+0x1fffffffff84b880) [0x40000000000045e0]
[merulo:08821] [ 8]
/lib/ia64-linux-gnu/libc.so.6.1(__libc_start_main-0x2fcd50)
[0x20000000004bd2a0]
[merulo:08821] [ 9] ./mpirun(_start+0x1fffffffff84a3c0) [0x40000000000043c0]
[merulo:08821] *** End of error message ***
Segmentation fault

bt:
Program received signal SIGSEGV, Segmentation fault.
0x2000000000867f30 in orte_plm_rsh_component_query (
    module=0x60000fffffffb0e8, priority=0x60000fffffffb0e0)
    at plm_rsh_component.c:205
205 OPAL_OUTPUT_VERBOSE((1, orte_plm_globals.output,
(gdb) bt
#0 0x2000000000867f30 in orte_plm_rsh_component_query (
    module=0x60000fffffffb0e8, priority=0x60000fffffffb0e0)
    at plm_rsh_component.c:205
#1 0x20000000001ddea0 in mca_base_select (
    type_name=0x200000000026e708 "plm", output_id=8,
    components_available=0x20000000002c5f08 <orte_plm_base>,
    best_module=0x60000fffffffb0f0, best_component=0x60000fffffffb0f8)
    at mca_base_components_select.c:76
#2 0x20000000001392f0 in orte_plm_base_select () at
base/plm_base_select.c:46
#3 0x20000000008316f0 in rte_init () at ess_hnp_module.c:169
#4 0x200000000008e0c0 in orte_init (pargc=0x60000fffffffb370,
    pargv=0x60000fffffffb378, flags=4) at runtime/orte_init.c:127
#5 0x4000000000006c60 in orterun (argc=15, argv=0x60000fffffffb628)
    at orterun.c:693
#6 0x40000000000045e0 in main (argc=15, argv=0x60000fffffffb628) at
main.c:13

S