Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] How to debug segv
From: Alex Margolin (alex.margolin_at_[hidden])
Date: 2012-04-25 07:05:27


Hi,

I'm getting a segv error off my build of the trunk. I know that my BTL
module is responsible ("-mca btl self,tcp" works, "-mca btl self,mosix"
fails). Smaller/simpler test applications pass, NPB doesn't. Can anyone
suggest how to proceed with debugging this? my attempts include some
debug printouts, and GDB which appears below... What can I do next?

I'll appreciate any input,
Alex

alex_at_singularity:~/huji/benchmarks/mpi/npb$ mpirun --debug-daemons -d -n
4 xterm -l -e gdb ft.S.4
[singularity:07557] procdir:
/tmp/openmpi-sessions-alex_at_singularity_0/44228/0/0
[singularity:07557] jobdir: /tmp/openmpi-sessions-alex_at_singularity_0/44228/0
[singularity:07557] top: openmpi-sessions-alex_at_singularity_0
[singularity:07557] tmp: /tmp
[singularity:07557] [[44228,0],0] hostfile: checking hostfile
/home/alex/huji/ompi/etc/openmpi-default-hostfile for nodes
[singularity:07557] [[44228,0],0] hostfile: filtering nodes through
hostfile /home/alex/huji/ompi/etc/openmpi-default-hostfile
[singularity:07557] [[44228,0],0] orted:comm:process_commands()
Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[singularity:07557] [[44228,0],0] orted_cmd: received add_local_procs
   MPIR_being_debugged = 0
   MPIR_debug_state = 1
   MPIR_partial_attach_ok = 1
   MPIR_i_am_starter = 0
   MPIR_forward_output = 0
   MPIR_proctable_size = 4
   MPIR_proctable:
     (i, host, exe, pid) = (0, singularity, /usr/bin/xterm, 7558)
     (i, host, exe, pid) = (1, singularity, /usr/bin/xterm, 7559)
     (i, host, exe, pid) = (2, singularity, /usr/bin/xterm, 7560)
     (i, host, exe, pid) = (3, singularity, /usr/bin/xterm, 7561)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[singularity:07592] procdir:
/tmp/openmpi-sessions-alex_at_singularity_0/44228/1/3
[singularity:07592] jobdir: /tmp/openmpi-sessions-alex_at_singularity_0/44228/1
[singularity:07592] top: openmpi-sessions-alex_at_singularity_0
[singularity:07592] tmp: /tmp
[singularity:07557] [[44228,0],0] orted:comm:process_commands()
Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
[singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from
local proc [[44228,1],3]
[singularity:07592] [[44228,1],3] decode:nidmap decoding nodemap
[singularity:07592] [[44228,1],3] decode:nidmap decoding 1 nodes
[singularity:07592] [[44228,1],3] node[0].name singularity daemon 0
[singularity:07594] procdir:
/tmp/openmpi-sessions-alex_at_singularity_0/44228/1/1
[singularity:07594] jobdir: /tmp/openmpi-sessions-alex_at_singularity_0/44228/1
[singularity:07594] top: openmpi-sessions-alex_at_singularity_0
[singularity:07594] tmp: /tmp
[singularity:07557] [[44228,0],0] orted:comm:process_commands()
Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
[singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from
local proc [[44228,1],1]
[singularity:07594] [[44228,1],1] decode:nidmap decoding nodemap
[singularity:07594] [[44228,1],1] decode:nidmap decoding 1 nodes
[singularity:07594] [[44228,1],1] node[0].name singularity daemon 0
[singularity:07596] procdir:
/tmp/openmpi-sessions-alex_at_singularity_0/44228/1/0
[singularity:07596] jobdir: /tmp/openmpi-sessions-alex_at_singularity_0/44228/1
[singularity:07596] top: openmpi-sessions-alex_at_singularity_0
[singularity:07596] tmp: /tmp
[singularity:07557] [[44228,0],0] orted:comm:process_commands()
Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
[singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from
local proc [[44228,1],0]
[singularity:07596] [[44228,1],0] decode:nidmap decoding nodemap
[singularity:07596] [[44228,1],0] decode:nidmap decoding 1 nodes
[singularity:07596] [[44228,1],0] node[0].name singularity daemon 0
[singularity:07598] procdir:
/tmp/openmpi-sessions-alex_at_singularity_0/44228/1/2
[singularity:07598] jobdir: /tmp/openmpi-sessions-alex_at_singularity_0/44228/1
[singularity:07598] top: openmpi-sessions-alex_at_singularity_0
[singularity:07598] tmp: /tmp
[singularity:07557] [[44228,0],0] orted:comm:process_commands()
Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
[singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from
local proc [[44228,1],2]
[singularity:07598] [[44228,1],2] decode:nidmap decoding nodemap
[singularity:07598] [[44228,1],2] decode:nidmap decoding 1 nodes
[singularity:07598] [[44228,1],2] node[0].name singularity daemon 0
[singularity:07557] [[44228,0],0] orted:comm:process_commands()
Processing Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS
[singularity:07557] [[44228,0],0] orted_cmd: received message_local_procs
[singularity:07557] [[44228,0],0] orted:comm:message_local_procs
delivering message to job [44228,1] tag 30
[singularity:07557] [[44228,0],0] orted:comm:process_commands()
Processing Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS
[singularity:07557] [[44228,0],0] orted_cmd: received message_local_procs
[singularity:07557] [[44228,0],0] orted:comm:message_local_procs
delivering message to job [44228,1] tag 30
[singularity:07557] [[44228,0],0]:errmgr_default_hnp.c(418) updating
exit status to 1
[singularity:07557] [[44228,0],0] orted:comm:process_commands()
Processing Command: ORTE_DAEMON_EXIT_CMD
[singularity:07557] [[44228,0],0] orted_cmd: received exit cmd
[singularity:07557] sess_dir_finalize: proc session dir not empty - leaving
[singularity:07557] sess_dir_finalize: proc session dir not empty - leaving
[singularity:07557] sess_dir_finalize: proc session dir not empty - leaving
[singularity:07557] [[44228,0],0] orted_cmd: all routes and children
gone - exiting
--------------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 7560 on
node singularity exiting improperly. There are three reasons this could
occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.

--------------------------------------------------------------------------
[singularity:07557] sess_dir_finalize: proc session dir not empty - leaving
exiting with status 1
alex_at_singularity:~/huji/benchmarks/mpi/npb$ grep SIGSEGV *
Xterm.log.singularity.2012.04.24.20.38.03.6992:During startup program
terminated with signal SIGSEGV, Segmentation fault.
Xterm.log.singularity.2012.04.25.13.55.01.7560:During startup program
terminated with signal SIGSEGV, Segmentation fault.
alex_at_singularity:~/huji/benchmarks/mpi/npb$ cat
Xterm.log.singularity.2012.04.25.13.55.01.7560
GNU gdb (Ubuntu/Linaro 7.3-0ubuntu2) 7.3-2011.08
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from
/home/alex/huji/benchmarks/mpi/NPB3.3.1/NPB3.3-MPI/bin/ft.S.4...(no
debugging symbols found)...done.
(gdb) r
Starting program:
/home/alex/huji/benchmarks/mpi/NPB3.3.1/NPB3.3-MPI/bin/ft.S.4
warning: Error disabling address space randomization: Function not
implemented
During startup program terminated with signal SIGSEGV, Segmentation fault.
(gdb) l
No symbol table is loaded. Use the "file" command.
(gdb) bt
No stack.
(gdb) alex_at_singularity:~/huji/benchmarks/mpi/npb$