Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
From: Samuel K. Gutierrez (samuel_at_[hidden])
Date: 2011-02-09 10:59:54


On Feb 8, 2011, at 8:21 PM, Ralph Castain wrote:

> I would personally suggest not reconfiguring your system simply to
> support a particular version of OMPI. The only difference between
> the 1.4 and 1.5 series wrt slurm is that we changed a few things to
> support a more recent version of slurm. It is relatively easy to
> backport that code to the 1.4 series, and it should be (mostly)
> backward compatible.
>
> OMPI is agnostic wrt resource managers. We try to support all
> platforms, with our effort reflective of the needs of our developers
> and their organizations, and our perception of the relative size of
> the user community for a particular platform. Slurm is a fairly
> small community, mostly centered in the three DOE weapons labs, so
> our support for that platform tends to focus on their usage.
>
> So, with that understanding...
>
> Sam: can you confirm that 1.5.1 works on your TLCC machines?

Open MPI 1.5.1 works as expected on our TLCC machines. Open MPI 1.4.3
with your SLURM update also tested.

>
> I have created a ticket to upgrade the 1.4.4 release (due out any
> time now) with the 1.5.1 slurm support. Any interested parties can
> follow it here:

Thanks Ralph!

Sam

>
> https://svn.open-mpi.org/trac/ompi/ticket/2717
>
> Ralph
>
>
> On Feb 8, 2011, at 6:23 PM, Michael Curtis wrote:
>
>>
>> On 09/02/2011, at 9:16 AM, Ralph Castain wrote:
>>
>>> See below
>>>
>>>
>>> On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote:
>>>
>>>>
>>>> On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote:
>>>>
>>>>> Hi Michael,
>>>>>
>>>>> You may have tried to send some debug information to the list,
>>>>> but it appears to have been blocked. Compressed text output of
>>>>> the backtrace text is sufficient.
>>>>
>>>>
>>>> Odd, I thought I sent it to you directly. In any case, here is
>>>> the backtrace and some information from gdb:
>>>>
>>>> $ salloc -n16 gdb -args mpirun mpi
>>>> (gdb) run
>>>> Starting program: /mnt/f1/michael/openmpi/bin/mpirun /mnt/f1/
>>>> michael/home/ServerAdmin/mpi
>>>> [Thread debugging using libthread_db enabled]
>>>>
>>>> Program received signal SIGSEGV, Segmentation fault.
>>>> 0x00007ffff7b76869 in process_orted_launch_report (fd=-1,
>>>> opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342
>>>> 342 pdatorted[mev->sender.vpid]->state =
>>>> ORTE_PROC_STATE_RUNNING;
>>>> (gdb) bt
>>>> #0 0x00007ffff7b76869 in process_orted_launch_report (fd=-1,
>>>> opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342
>>>> #1 0x00007ffff78a7338 in event_process_active (base=0x615240) at
>>>> event.c:651
>>>> #2 0x00007ffff78a797e in opal_event_base_loop (base=0x615240,
>>>> flags=1) at event.c:823
>>>> #3 0x00007ffff78a756f in opal_event_loop (flags=1) at event.c:730
>>>> #4 0x00007ffff789b916 in opal_progress () at runtime/
>>>> opal_progress.c:189
>>>> #5 0x00007ffff7b76e20 in orte_plm_base_daemon_callback
>>>> (num_daemons=2) at base/plm_base_launch_support.c:459
>>>> #6 0x00007ffff7b7bed7 in plm_slurm_launch_job (jdata=0x610560)
>>>> at plm_slurm_module.c:360
>>>> #7 0x0000000000403f46 in orterun (argc=2, argv=0x7fffffffe7d8)
>>>> at orterun.c:754
>>>> #8 0x0000000000402fb4 in main (argc=2, argv=0x7fffffffe7d8) at
>>>> main.c:13
>>>> (gdb) print pdatorted
>>>> $1 = (orte_proc_t **) 0x67c610
>>>> (gdb) print mev
>>>> $2 = (orte_message_event_t *) 0x681550
>>>> (gdb) print mev->sender.vpid
>>>> $3 = 4294967295
>>>> (gdb) print mev->sender
>>>> $4 = {jobid = 1721696256, vpid = 4294967295}
>>>> (gdb) print *mev
>>>> $5 = {super = {obj_magic_id = 16046253926196952813, obj_class =
>>>> 0x7ffff7dd4f40, obj_reference_count = 1, cls_init_file_name =
>>>> 0x7ffff7bb9a78 "base/plm_base_launch_support.c",
>>>> cls_init_lineno = 423}, ev = 0x680850, sender = {jobid =
>>>> 1721696256, vpid = 4294967295}, buffer = 0x6811b0, tag = 10, file
>>>> = 0x680640 "rml_oob_component.c", line = 279}
>>>
>>> The jobid and vpid look like the defined INVALID values,
>>> indicating that something is quite wrong. This would quite likely
>>> lead to the segfault.
>>>
>>>> From this, it would indeed appear that you are getting some kind
>>>> of library confusion - the most likely cause of such an error is
>>>> a daemon from a different version trying to respond, and so the
>>>> returned message isn't correct.
>>>
>>> Not sure why else it would be happening...you could try setting -
>>> mca plm_base_verbose 5 to get more debug output displayed on your
>>> screen, assuming you built OMPI with --enable-debug.
>>>
>>
>> Found the problem.... It is a site configuration issue, which I'll
>> need to find a workaround for.
>>
>> [bio-ipc.{FQDN}:27523] mca:base:select:( plm) Query of component
>> [slurm] set priority to 75
>> [bio-ipc.{FQDN}:27523] mca:base:select:( plm) Selected component
>> [slurm]
>> [bio-ipc.{FQDN}:27523] mca: base: close: component rsh closed
>> [bio-ipc.{FQDN}:27523] mca: base: close: unloading component rsh
>> [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: initial bias 27523
>> nodename hash 1936089714
>> [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: final jobfam 31383
>> [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:receive start comm
>> [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching job
>> [31383,1]
>> [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:setup_job for job
>> [31383,1]
>> [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching on nodes
>> ipc3
>> [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: final top-level argv:
>> srun --nodes=1 --ntasks=1 --kill-on-bad-exit --nodelist=ipc3 orted
>> -mca ess slurm -mca orte_ess_jobid 2056716288 -mca orte_ess_vpid 1 -
>> mca orte_ess_num_procs 2 --hnp-uri "2056716288.0;tcp://lanip:
>> 37493;tcp://globalip:37493;tcp://lanip2:37493" -mca
>> plm_base_verbose 20
>>
>> I then inserted some printf's into the ess_slurm_module (rough and
>> ready, I know, but I was in a hurry).
>>
>> Just after initialisation: (at around line 345)
>> orte_ess_slurm: jobid 2056716288 vpid 1
>> So it gets that...
>> I narrowed it down to the get_slurm_nodename function, as the
>> method didn't proceed past that point....
>>
>> line 401:
>> tmp = strdup(orte_process_info.nodename);
>> printf( "Our node name == %s\n", tmp );
>> line 409:
>> for (i=0; NULL != names[i]; i++) {
>> printf( "Checking %s\n", names[ i ]);
>>
>> Result:
>> Our node name == eng-ipc3.{FQDN}
>> Checking ipc3
>>
>> So it's down to the mismatch of the slurm name and the hostname.
>> slurm really encourages you not to use the fully qualified
>> hostname, and I'd prefer not to have to reconfigure the whole
>> system to use the shortname as hostnames. However, I note that
>> 1.5.1 worked and backported some of the code -- it uses
>> getenv( "SLURM_NODE_ID" ) to get that node number, which doesn't
>> rely on an exact string match. Patching this makes things kind of
>> work, but failures still occur during wire-up for more than one node.
>>
>> I think the solution will have to be to change the hostnames on the
>> system to match what is needed by slurm+openmpi. (doing this
>> temporarily makes everything work with an unpatched 1.4.3 and the
>> wireup completes successfully). Perhaps a note about system
>> hostnames needs to be made somewhere in the OpenMPI / SLURM
>> documentation?
>>
>> Thank you Ralph & Sam for your help.
>>
>> Cheers,
>> Michael
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users