Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
From: Michael Curtis (michael.curtis_at_[hidden])
Date: 2011-02-08 16:44:21


On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote:

> Hi Michael,
>
> You may have tried to send some debug information to the list, but it appears to have been blocked. Compressed text output of the backtrace text is sufficient.

Odd, I thought I sent it to you directly. In any case, here is the backtrace and some information from gdb:

$ salloc -n16 gdb -args mpirun mpi
(gdb) run
Starting program: /mnt/f1/michael/openmpi/bin/mpirun /mnt/f1/michael/home/ServerAdmin/mpi
[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7b76869 in process_orted_launch_report (fd=-1, opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342
342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
(gdb) bt
#0 0x00007ffff7b76869 in process_orted_launch_report (fd=-1, opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342
#1 0x00007ffff78a7338 in event_process_active (base=0x615240) at event.c:651
#2 0x00007ffff78a797e in opal_event_base_loop (base=0x615240, flags=1) at event.c:823
#3 0x00007ffff78a756f in opal_event_loop (flags=1) at event.c:730
#4 0x00007ffff789b916 in opal_progress () at runtime/opal_progress.c:189
#5 0x00007ffff7b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at base/plm_base_launch_support.c:459
#6 0x00007ffff7b7bed7 in plm_slurm_launch_job (jdata=0x610560) at plm_slurm_module.c:360
#7 0x0000000000403f46 in orterun (argc=2, argv=0x7fffffffe7d8) at orterun.c:754
#8 0x0000000000402fb4 in main (argc=2, argv=0x7fffffffe7d8) at main.c:13
(gdb) print pdatorted
$1 = (orte_proc_t **) 0x67c610
(gdb) print mev
$2 = (orte_message_event_t *) 0x681550
(gdb) print mev->sender.vpid
$3 = 4294967295
(gdb) print mev->sender
$4 = {jobid = 1721696256, vpid = 4294967295}
(gdb) print *mev
$5 = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x7ffff7dd4f40, obj_reference_count = 1, cls_init_file_name = 0x7ffff7bb9a78 "base/plm_base_launch_support.c",
   cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 "rml_oob_component.c", line = 279}

That vpid looks suspiciously like -1.

Further debugging:
Breakpoint 3, orted_report_launch (status=32767, sender=0x7fffffffe170, buffer=0x7ffff7b1a85f, tag=32767, cbdata=0x612d20) at base/plm_base_launch_support.c:411
411 {
(gdb) print sender
$2 = (orte_process_name_t *) 0x7fffffffe170
(gdb) print *sender
$3 = {jobid = 6822016, vpid = 0}
(gdb) continue
Continuing.
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7b76869 in process_orted_launch_report (fd=-1, opal_event=1, data=0x681550) at base/plm_base_launch_support.c:342
342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
(gdb) print mev->sender
$4 = {jobid = 1778450432, vpid = 4294967295}

The daemon probably died as I spent too long thinking about my gdb input ;)