Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-02-08 17:16:05


See below

On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote:

>
> On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote:
>
>> Hi Michael,
>>
>> You may have tried to send some debug information to the list, but it appears to have been blocked. Compressed text output of the backtrace text is sufficient.
>
>
> Odd, I thought I sent it to you directly. In any case, here is the backtrace and some information from gdb:
>
> $ salloc -n16 gdb -args mpirun mpi
> (gdb) run
> Starting program: /mnt/f1/michael/openmpi/bin/mpirun /mnt/f1/michael/home/ServerAdmin/mpi
> [Thread debugging using libthread_db enabled]
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x00007ffff7b76869 in process_orted_launch_report (fd=-1, opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342
> 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
> (gdb) bt
> #0 0x00007ffff7b76869 in process_orted_launch_report (fd=-1, opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342
> #1 0x00007ffff78a7338 in event_process_active (base=0x615240) at event.c:651
> #2 0x00007ffff78a797e in opal_event_base_loop (base=0x615240, flags=1) at event.c:823
> #3 0x00007ffff78a756f in opal_event_loop (flags=1) at event.c:730
> #4 0x00007ffff789b916 in opal_progress () at runtime/opal_progress.c:189
> #5 0x00007ffff7b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at base/plm_base_launch_support.c:459
> #6 0x00007ffff7b7bed7 in plm_slurm_launch_job (jdata=0x610560) at plm_slurm_module.c:360
> #7 0x0000000000403f46 in orterun (argc=2, argv=0x7fffffffe7d8) at orterun.c:754
> #8 0x0000000000402fb4 in main (argc=2, argv=0x7fffffffe7d8) at main.c:13
> (gdb) print pdatorted
> $1 = (orte_proc_t **) 0x67c610
> (gdb) print mev
> $2 = (orte_message_event_t *) 0x681550
> (gdb) print mev->sender.vpid
> $3 = 4294967295
> (gdb) print mev->sender
> $4 = {jobid = 1721696256, vpid = 4294967295}
> (gdb) print *mev
> $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x7ffff7dd4f40, obj_reference_count = 1, cls_init_file_name = 0x7ffff7bb9a78 "base/plm_base_launch_support.c",
> cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 "rml_oob_component.c", line = 279}

The jobid and vpid look like the defined INVALID values, indicating that something is quite wrong. This would quite likely lead to the segfault.

>From this, it would indeed appear that you are getting some kind of library confusion - the most likely cause of such an error is a daemon from a different version trying to respond, and so the returned message isn't correct.

Not sure why else it would be happening...you could try setting -mca plm_base_verbose 5 to get more debug output displayed on your screen, assuming you built OMPI with --enable-debug.

>
> That vpid looks suspiciously like -1.
>
> Further debugging:
> Breakpoint 3, orted_report_launch (status=32767, sender=0x7fffffffe170, buffer=0x7ffff7b1a85f, tag=32767, cbdata=0x612d20) at base/plm_base_launch_support.c:411
> 411 {
> (gdb) print sender
> $2 = (orte_process_name_t *) 0x7fffffffe170
> (gdb) print *sender
> $3 = {jobid = 6822016, vpid = 0}
> (gdb) continue
> Continuing.
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x00007ffff7b76869 in process_orted_launch_report (fd=-1, opal_event=1, data=0x681550) at base/plm_base_launch_support.c:342
> 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
> (gdb) print mev->sender
> $4 = {jobid = 1778450432, vpid = 4294967295}
>
> The daemon probably died as I spent too long thinking about my gdb input ;)

I'm not sure why that would happen - there are no timers in the system, so it won't care how long it takes to initialize. I'm guessing this is another indicator of a library issue.

>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users