Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Grobe, Gary L. \(JSC-EV\)[ESCG] (gary.l.grobe_at_[hidden])
Date: 2007-01-08 15:46:09


> >> PS: Is there any way you can attach to the processes with gdb ? I
> >> would like to see the backtrace as showed by gdb in order
> to be able
> >> to figure out what's wrong there.
> >
> > When I can get more detailed dbg, I'll send. Though I'm not
> clear on
> > what executable is being searched for below.
> >
> > $ mpirun -dbg=gdb --prefix /usr/local/openmpi-1.2b3r13030 -x
> > LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5
> --mca pml
> > cm --mca mtl mx ./cpi
>
> FWIW, note that "-dbg" is not a recognized Open MPI mpirun
> command line switch -- after all the debugging information,
> Open MPI finally gets to telling you:
>

Sorry, wrong mpi, ok ... Fwiw, here's a working crash w/ just the -d
option. The problem I'm trying to get to right now is how to dbg the 2nd
process on the 2nd node since that's where the crash is always
happening. One process past the 1st node works find (5 procs w/ 4 per
node), but when a second process on the 2nd node starts or anything more
than that, the crashes will occur.

$ mpirun -d --prefix /usr/local/openmpi-1.2b3r13030 -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 6 --mca pml cm
--mca mtl mx ./cpi > dbg.out 2>&1

[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] connect_uni: connection not allowed
[juggernaut:15087] [0,0,0] setting up session dir with
[juggernaut:15087] universe default-universe-15087
[juggernaut:15087] user ggrobe
[juggernaut:15087] host juggernaut
[juggernaut:15087] jobid 0
[juggernaut:15087] procid 0
[juggernaut:15087] procdir:
/tmp/openmpi-sessions-ggrobe_at_juggernaut_0/default-universe-15087/0/0
[juggernaut:15087] jobdir:
/tmp/openmpi-sessions-ggrobe_at_juggernaut_0/default-universe-15087/0
[juggernaut:15087] unidir:
/tmp/openmpi-sessions-ggrobe_at_juggernaut_0/default-universe-15087
[juggernaut:15087] top: openmpi-sessions-ggrobe_at_juggernaut_0
[juggernaut:15087] tmp: /tmp
[juggernaut:15087] [0,0,0] contact_file
/tmp/openmpi-sessions-ggrobe_at_juggernaut_0/default-universe-15087/univers
e-setup.txt
[juggernaut:15087] [0,0,0] wrote setup file
[juggernaut:15087] pls:rsh: local csh: 0, local sh: 1
[juggernaut:15087] pls:rsh: assuming same remote shell as local shell
[juggernaut:15087] pls:rsh: remote csh: 0, remote sh: 1
[juggernaut:15087] pls:rsh: final template argv:
[juggernaut:15087] pls:rsh: /usr/bin/ssh <template> orted --debug
--bootproxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodename
<template> --universe ggrobe_at_juggernaut:default-universe-15087
--nsreplica "0.0.0;tcp://192.168.2.10:52099" --gprreplica
"0.0.0;tcp://192.168.2.10:52099"
[juggernaut:15087] pls:rsh: launching on node node-1
[juggernaut:15087] pls:rsh: node-1 is a REMOTE node
[juggernaut:15087] pls:rsh: executing: /usr/bin/ssh node-1
PATH=/usr/local/openmpi-1.2b3r13030/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/usr/local/openmpi-1.2b3r13030/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ; /usr/local/openmpi-1.2b3r13030/bin/orted
--debug --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0
--nodename node-1 --universe ggrobe_at_juggernaut:default-universe-15087
--nsreplica "0.0.0;tcp://192.168.2.10:52099" --gprreplica
"0.0.0;tcp://192.168.2.10:52099"
[juggernaut:15087] pls:rsh: launching on node node-2
[juggernaut:15087] pls:rsh: node-2 is a REMOTE node
[juggernaut:15087] pls:rsh: executing: /usr/bin/ssh node-2
PATH=/usr/local/openmpi-1.2b3r13030/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/usr/local/openmpi-1.2b3r13030/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ; /usr/local/openmpi-1.2b3r13030/bin/orted
--debug --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0
--nodename node-2 --universe ggrobe_at_juggernaut:default-universe-15087
--nsreplica "0.0.0;tcp://192.168.2.10:52099" --gprreplica
"0.0.0;tcp://192.168.2.10:52099"
[node-2:11499] [0,0,2] setting up session dir with
[node-2:11499] universe default-universe-15087
[node-2:11499] user ggrobe
[node-2:11499] host node-2
[node-2:11499] jobid 0
[node-2:11499] procid 2
[node-1:10307] procdir:
/tmp/openmpi-sessions-ggrobe_at_node-1_0/default-universe-15087/0/1
[node-1:10307] jobdir:
/tmp/openmpi-sessions-ggrobe_at_node-1_0/default-universe-15087/0
[node-1:10307] unidir:
/tmp/openmpi-sessions-ggrobe_at_node-1_0/default-universe-15087
[node-1:10307] top: openmpi-sessions-ggrobe_at_node-1_0
[node-2:11499] procdir:
/tmp/openmpi-sessions-ggrobe_at_node-2_0/default-universe-15087/0/2
[node-2:11499] jobdir:
/tmp/openmpi-sessions-ggrobe_at_node-2_0/default-universe-15087/0
[node-2:11499] unidir:
/tmp/openmpi-sessions-ggrobe_at_node-2_0/default-universe-15087
[node-2:11499] top: openmpi-sessions-ggrobe_at_node-2_0
[node-2:11499] tmp: /tmp
[node-1:10307] tmp: /tmp
[node-2:11500] [0,1,4] setting up session dir with
[node-2:11500] universe default-universe-15087
[node-2:11500] user ggrobe
[node-2:11500] host node-2
[node-2:11500] jobid 1
[node-2:11500] procid 4
[node-2:11501] [0,1,5] setting up session dir with
[node-2:11501] universe default-universe-15087
[node-2:11501] user ggrobe
[node-2:11501] host node-2
[node-2:11501] jobid 1
[node-2:11501] procid 5
[node-1:10308] [0,1,0] setting up session dir with
[node-1:10308] universe default-universe-15087
[node-1:10308] user ggrobe
[node-1:10308] host node-1
[node-1:10308] jobid 1
[node-1:10308] procid 0
[node-2:11500] procdir:
/tmp/openmpi-sessions-ggrobe_at_node-2_0/default-universe-15087/1/4
[node-2:11500] jobdir:
/tmp/openmpi-sessions-ggrobe_at_node-2_0/default-universe-15087/1
[node-2:11500] unidir:
/tmp/openmpi-sessions-ggrobe_at_node-2_0/default-universe-15087
[node-2:11500] top: openmpi-sessions-ggrobe_at_node-2_0
[node-2:11500] tmp: /tmp
[node-2:11501] procdir:
/tmp/openmpi-sessions-ggrobe_at_node-2_0/default-universe-15087/1/5
[node-2:11501] jobdir:
/tmp/openmpi-sessions-ggrobe_at_node-2_0/default-universe-15087/1
[node-2:11501] unidir:
/tmp/openmpi-sessions-ggrobe_at_node-2_0/default-universe-15087
[node-2:11501] top: openmpi-sessions-ggrobe_at_node-2_0
[node-2:11501] tmp: /tmp
[node-1:10308] procdir:
/tmp/openmpi-sessions-ggrobe_at_node-1_0/default-universe-15087/1/0
[node-1:10308] jobdir:
/tmp/openmpi-sessions-ggrobe_at_node-1_0/default-universe-15087/1
[node-1:10308] unidir:
/tmp/openmpi-sessions-ggrobe_at_node-1_0/default-universe-15087
[node-1:10308] top: openmpi-sessions-ggrobe_at_node-1_0
[node-1:10308] tmp: /tmp
[node-1:10311] [0,1,3] setting up session dir with
[node-1:10311] universe default-universe-15087
[node-1:10311] user ggrobe
[node-1:10311] host node-1
[node-1:10311] jobid 1
[node-1:10311] procid 3
[node-1:10310] [0,1,2] setting up session dir with
[node-1:10310] universe default-universe-15087
[node-1:10310] user ggrobe
[node-1:10310] host node-1
[node-1:10310] jobid 1
[node-1:10310] procid 2
[node-1:10311] procdir:
/tmp/openmpi-sessions-ggrobe_at_node-1_0/default-universe-15087/1/3
[node-1:10311] jobdir:
/tmp/openmpi-sessions-ggrobe_at_node-1_0/default-universe-15087/1
[node-1:10311] unidir:
/tmp/openmpi-sessions-ggrobe_at_node-1_0/default-universe-15087
[node-1:10311] top: openmpi-sessions-ggrobe_at_node-1_0
[node-1:10311] tmp: /tmp
[node-1:10310] procdir:
/tmp/openmpi-sessions-ggrobe_at_node-1_0/default-universe-15087/1/2
[node-1:10310] jobdir:
/tmp/openmpi-sessions-ggrobe_at_node-1_0/default-universe-15087/1
[node-1:10310] unidir:
/tmp/openmpi-sessions-ggrobe_at_node-1_0/default-universe-15087
[node-1:10310] top: openmpi-sessions-ggrobe_at_node-1_0
[node-1:10310] tmp: /tmp
[node-1:10309] [0,1,1] setting up session dir with
[node-1:10309] universe default-universe-15087
[node-1:10309] user ggrobe
[node-1:10309] host node-1
[node-1:10309] jobid 1
[node-1:10309] procid 1
[node-1:10309] procdir:
/tmp/openmpi-sessions-ggrobe_at_node-1_0/default-universe-15087/1/1
[node-1:10309] jobdir:
/tmp/openmpi-sessions-ggrobe_at_node-1_0/default-universe-15087/1
[node-1:10309] unidir:
/tmp/openmpi-sessions-ggrobe_at_node-1_0/default-universe-15087
[node-1:10309] top: openmpi-sessions-ggrobe_at_node-1_0
[node-1:10309] tmp: /tmp
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:(nil)
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:(nil)
[0]
func:/usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0(opal_backtrace_
print+0x1f) [0x2b8b99905d3f]
[1] func:/usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0
[0x2b8b99904891]
[2] func:/lib/libpthread.so.0 [0x2b8b99ec6d00]
[3] func:/opt/mx/lib/libmyriexpress.so(mx_open_endpoint+0x6df)
[0x2b8b9cb072af]
[4]
func:/usr/local/openmpi-1.2b3r13030/lib/openmpi/mca_mtl_mx.so(ompi_mtl_m
x_module_init+0x20) [0x2b8b9c9fcb50]
[5] func:/usr/local/openmpi-1.2b3r13030/lib/openmpi/mca_mtl_mx.so
[0x2b8b9c9fccb5]
[6]
func:/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0(ompi_mtl_base_select
+0x6f) [0x2b8b9966165f]
[7] func:/usr/local/openmpi-1.2b3r13030/lib/openmpi/mca_pml_cm.so
[0x2b8b9c6d1aa6]
[8]
func:/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0(mca_pml_base_select+
0x113) [0x2b8b99663ef3]
[9]
func:/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0(ompi_mpi_init+0x45e)
[0x2b8b9962c7de]
[10] func:/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0(MPI_Init+0x83)
[0x2b8b9964d903]
[11] func:./cpi(main+0x42) [0x400cd5]
[12] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b8b99fed134]
[13] func:./cpi [0x400bd9]
*** End of error message ***
^@[0]
func:/usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0(opal_backtrace_
print+0x1f) [0x2b548c138d3f]
[1] func:/usr/local/openmpi-1.2b3r13030/lib/libopen-pal.so.0
[0x2b548c137891]
[2] func:/lib/libpthread.so.0 [0x2b548c6f9d00]
[3] func:/opt/mx/lib/libmyriexpress.so(mx_open_endpoint+0x6df)
[0x2b548f33a2af]
[4]
func:/usr/local/openmpi-1.2b3r13030/lib/openmpi/mca_mtl_mx.so(ompi_mtl_m
x_module_init+0x20) [0x2b548f22fb50]
[5] func:/usr/local/openmpi-1.2b3r13030/lib/openmpi/mca_mtl_mx.so
[0x2b548f22fcb5]
[6]
func:/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0(ompi_mtl_base_select
+0x6f) [0x2b548be9465f]
[7] func:/usr/local/openmpi-1.2b3r13030/lib/openmpi/mca_pml_cm.so
[0x2b548ef04aa6]
[8]
func:/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0(mca_pml_base_select+
0x113) [0x2b548be96ef3]
[9]
func:/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0(ompi_mpi_init+0x45e)
[0x2b548be5f7de]
[10] func:/usr/local/openmpi-1.2b3r13030/lib/libmpi.so.0(MPI_Init+0x83)
[0x2b548be80903]
[11] func:./cpi(main+0x42) [0x400cd5]
[12] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b548c820134]
[13] func:./cpi [0x400bd9]
*** End of error message ***
^@[node-1:10307] sess_dir_finalize: proc session dir not empty - leaving
[juggernaut:15087] spawn: in job_state_callback(jobid = 1, state = 0x80)
mpirun noticed that job rank 0 with PID 0 on node node-1 exited on
signal 15.
[node-1:10307] sess_dir_finalize: job session dir not empty - leaving
[node-2:11499] sess_dir_finalize: job session dir not empty - leaving
5 additional processes aborted (not shown)
[juggernaut:15087] sess_dir_finalize: proc session dir not empty -
leaving
[node-1:10307] sess_dir_finalize: proc session dir not empty - leaving
[node-2:11499] sess_dir_finalize: proc session dir not empty - leaving