Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] How to debug segv
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-04-25 07:57:35


Strange that your code didn't generate any symbols - is that a mosix thing? Have you tried just adding opal_output (so it goes to a special diagnostic output channel) statements in your code to see where the segfault is occurring?

It looks like you are getting thru orte_init. You could add -mca grpcomm_base_verbose 5 to see if you are getting in/thru the modex - if so, then you are probably failing in add_procs.

On Apr 25, 2012, at 5:05 AM, Alex Margolin wrote:

> Hi,
>
> I'm getting a segv error off my build of the trunk. I know that my BTL module is responsible ("-mca btl self,tcp" works, "-mca btl self,mosix" fails). Smaller/simpler test applications pass, NPB doesn't. Can anyone suggest how to proceed with debugging this? my attempts include some debug printouts, and GDB which appears below... What can I do next?
>
> I'll appreciate any input,
> Alex
>
> alex_at_singularity:~/huji/benchmarks/mpi/npb$ mpirun --debug-daemons -d -n 4 xterm -l -e gdb ft.S.4
> [singularity:07557] procdir: /tmp/openmpi-sessions-alex_at_singularity_0/44228/0/0
> [singularity:07557] jobdir: /tmp/openmpi-sessions-alex_at_singularity_0/44228/0
> [singularity:07557] top: openmpi-sessions-alex_at_singularity_0
> [singularity:07557] tmp: /tmp
> [singularity:07557] [[44228,0],0] hostfile: checking hostfile /home/alex/huji/ompi/etc/openmpi-default-hostfile for nodes
> [singularity:07557] [[44228,0],0] hostfile: filtering nodes through hostfile /home/alex/huji/ompi/etc/openmpi-default-hostfile
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
> [singularity:07557] [[44228,0],0] orted_cmd: received add_local_procs
> MPIR_being_debugged = 0
> MPIR_debug_state = 1
> MPIR_partial_attach_ok = 1
> MPIR_i_am_starter = 0
> MPIR_forward_output = 0
> MPIR_proctable_size = 4
> MPIR_proctable:
> (i, host, exe, pid) = (0, singularity, /usr/bin/xterm, 7558)
> (i, host, exe, pid) = (1, singularity, /usr/bin/xterm, 7559)
> (i, host, exe, pid) = (2, singularity, /usr/bin/xterm, 7560)
> (i, host, exe, pid) = (3, singularity, /usr/bin/xterm, 7561)
> MPIR_executable_path: NULL
> MPIR_server_arguments: NULL
> [singularity:07592] procdir: /tmp/openmpi-sessions-alex_at_singularity_0/44228/1/3
> [singularity:07592] jobdir: /tmp/openmpi-sessions-alex_at_singularity_0/44228/1
> [singularity:07592] top: openmpi-sessions-alex_at_singularity_0
> [singularity:07592] tmp: /tmp
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
> [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local proc [[44228,1],3]
> [singularity:07592] [[44228,1],3] decode:nidmap decoding nodemap
> [singularity:07592] [[44228,1],3] decode:nidmap decoding 1 nodes
> [singularity:07592] [[44228,1],3] node[0].name singularity daemon 0
> [singularity:07594] procdir: /tmp/openmpi-sessions-alex_at_singularity_0/44228/1/1
> [singularity:07594] jobdir: /tmp/openmpi-sessions-alex_at_singularity_0/44228/1
> [singularity:07594] top: openmpi-sessions-alex_at_singularity_0
> [singularity:07594] tmp: /tmp
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
> [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local proc [[44228,1],1]
> [singularity:07594] [[44228,1],1] decode:nidmap decoding nodemap
> [singularity:07594] [[44228,1],1] decode:nidmap decoding 1 nodes
> [singularity:07594] [[44228,1],1] node[0].name singularity daemon 0
> [singularity:07596] procdir: /tmp/openmpi-sessions-alex_at_singularity_0/44228/1/0
> [singularity:07596] jobdir: /tmp/openmpi-sessions-alex_at_singularity_0/44228/1
> [singularity:07596] top: openmpi-sessions-alex_at_singularity_0
> [singularity:07596] tmp: /tmp
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
> [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local proc [[44228,1],0]
> [singularity:07596] [[44228,1],0] decode:nidmap decoding nodemap
> [singularity:07596] [[44228,1],0] decode:nidmap decoding 1 nodes
> [singularity:07596] [[44228,1],0] node[0].name singularity daemon 0
> [singularity:07598] procdir: /tmp/openmpi-sessions-alex_at_singularity_0/44228/1/2
> [singularity:07598] jobdir: /tmp/openmpi-sessions-alex_at_singularity_0/44228/1
> [singularity:07598] top: openmpi-sessions-alex_at_singularity_0
> [singularity:07598] tmp: /tmp
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_SYNC_WANT_NIDMAP
> [singularity:07557] [[44228,0],0] orted_recv: received sync+nidmap from local proc [[44228,1],2]
> [singularity:07598] [[44228,1],2] decode:nidmap decoding nodemap
> [singularity:07598] [[44228,1],2] decode:nidmap decoding 1 nodes
> [singularity:07598] [[44228,1],2] node[0].name singularity daemon 0
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS
> [singularity:07557] [[44228,0],0] orted_cmd: received message_local_procs
> [singularity:07557] [[44228,0],0] orted:comm:message_local_procs delivering message to job [44228,1] tag 30
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_MESSAGE_LOCAL_PROCS
> [singularity:07557] [[44228,0],0] orted_cmd: received message_local_procs
> [singularity:07557] [[44228,0],0] orted:comm:message_local_procs delivering message to job [44228,1] tag 30
> [singularity:07557] [[44228,0],0]:errmgr_default_hnp.c(418) updating exit status to 1
> [singularity:07557] [[44228,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_EXIT_CMD
> [singularity:07557] [[44228,0],0] orted_cmd: received exit cmd
> [singularity:07557] sess_dir_finalize: proc session dir not empty - leaving
> [singularity:07557] sess_dir_finalize: proc session dir not empty - leaving
> [singularity:07557] sess_dir_finalize: proc session dir not empty - leaving
> [singularity:07557] [[44228,0],0] orted_cmd: all routes and children gone - exiting
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 2 with PID 7560 on
> node singularity exiting improperly. There are three reasons this could occur:
>
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
> orte_create_session_dirs is set to false. In this case, the run-time cannot
> detect that the abort call was an abnormal termination. Hence, the only
> error message you will receive is this one.
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
>
> You can avoid this message by specifying -quiet on the mpirun command line.
>
> --------------------------------------------------------------------------
> [singularity:07557] sess_dir_finalize: proc session dir not empty - leaving
> exiting with status 1
> alex_at_singularity:~/huji/benchmarks/mpi/npb$ grep SIGSEGV *
> Xterm.log.singularity.2012.04.24.20.38.03.6992:During startup program terminated with signal SIGSEGV, Segmentation fault.
> Xterm.log.singularity.2012.04.25.13.55.01.7560:During startup program terminated with signal SIGSEGV, Segmentation fault.
> alex_at_singularity:~/huji/benchmarks/mpi/npb$ cat Xterm.log.singularity.2012.04.25.13.55.01.7560
> GNU gdb (Ubuntu/Linaro 7.3-0ubuntu2) 7.3-2011.08
> Copyright (C) 2011 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law. Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> For bug reporting instructions, please see:
> <http://bugs.launchpad.net/gdb-linaro/>...
> Reading symbols from /home/alex/huji/benchmarks/mpi/NPB3.3.1/NPB3.3-MPI/bin/ft.S.4...(no debugging symbols found)...done.
> (gdb) r
> Starting program: /home/alex/huji/benchmarks/mpi/NPB3.3.1/NPB3.3-MPI/bin/ft.S.4
> warning: Error disabling address space randomization: Function not implemented
> During startup program terminated with signal SIGSEGV, Segmentation fault.
> (gdb) l
> No symbol table is loaded. Use the "file" command.
> (gdb) bt
> No stack.
> (gdb) alex_at_singularity:~/huji/benchmarks/mpi/npb$
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel