Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
From: Michael Curtis (michael.curtis_at_[hidden])
Date: 2011-01-28 02:16:04


On 27/01/2011, at 4:51 PM, Michael Curtis wrote:

Some more debugging information:

> Failing case:
> michael_at_ipc ~ $ salloc -n8 mpirun --display-map ./mpi
> ======================== JOB MAP ========================

Backtrace with debugging symbols
#0 0x00007ffff7bb5c1e in ?? () from /usr/lib/libopen-rte.so.0
#1 0x00007ffff792e23f in ?? () from /usr/lib/libopen-pal.so.0
#2 0x00007ffff7920679 in opal_progress () from /usr/lib/libopen-pal.so.0
#3 0x00007ffff7bb6e5d in orte_plm_base_daemon_callback () from /usr/lib/libopen-rte.so.0
#4 0x00007ffff62b67e7 in plm_slurm_launch_job (jdata=<value optimised out>) at ../../../../../../orte/mca/plm/slurm/plm_slurm_module.c:360
#5 0x00000000004041c8 in orterun (argc=4, argv=0x7fffffffe7d8) at ../../../../../orte/tools/orterun/orterun.c:754
#6 0x0000000000403234 in main (argc=4, argv=0x7fffffffe7d8) at ../../../../../orte/tools/orterun/main.c:13

Trace output with -d100 and --enable-trace:
[:10821] progressed_wait: ../../../../../orte/mca/plm/base/plm_base_launch_support.c 459
[:10821] defining message event: ../../../../../orte/mca/plm/base/plm_base_launch_support.c 423

I'm guessing from this that it's crashing in the event loop, maybe at :
        static void process_orted_launch_report(int fd, short event, void *data)
 
strace:
poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=13, events=POLLIN}], 6, 1000) = 1 ([{fd=13, revents=POLLIN}])
readv(13, [{"R\333\0\0\377\377\377\377R\333\0\0\377\377\377\377R\333\0\0\0\0\0\0\0\0\0\4\0\0\0\232"..., 36}], 1) = 36
readv(13, [{"R\333\0\0\377\377\377\377R\333\0\0\0\0\0\0\0\0\0\n\0\0\0\1\0\0\0u1390"..., 154}], 1) = 154
poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=13, events=POLLIN}], 6, 0) = 0 (Timeout)
--- SIGSEGV (Segmentation fault) @ 0 (0) ---

OK, I matched the disassemblies and confirmed that the crash originates in process_orted_launch_report, and therefore matched up the source code line with where gdb reckons the program counter was at that point:

    /* update state */
    pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;

Hopefully all this information helps a little!