Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
From: Samuel K. Gutierrez (samuel_at_[hidden])
Date: 2011-02-07 10:38:25


Hi,

A detailed backtrace from a core dump may help us debug this. Would
you be willing to provide that information for us?

Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote:
>
> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:
>
> Hi,
>
>> I just tried to reproduce the problem that you are experiencing and  
>> was unable to.
>>
>> SLURM 2.1.15
>> Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ 
>> lanl/tlcc/debug-nopanasas
>
> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same  
> platform file (the only change was to re-enable btl-tcp).
>
> Unfortunately, the result is the same:
> salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi
> salloc: Granted job allocation 145
>
> ========================   JOB MAP   ========================
>
> Data for node: Name: eng-ipc4.{FQDN}		Num procs: 8
> 	Process OMPI jobid: [6932,1] Process rank: 0
> 	Process OMPI jobid: [6932,1] Process rank: 1
> 	Process OMPI jobid: [6932,1] Process rank: 2
> 	Process OMPI jobid: [6932,1] Process rank: 3
> 	Process OMPI jobid: [6932,1] Process rank: 4
> 	Process OMPI jobid: [6932,1] Process rank: 5
> 	Process OMPI jobid: [6932,1] Process rank: 6
> 	Process OMPI jobid: [6932,1] Process rank: 7
>
> Data for node: Name: ipc3	Num procs: 8
> 	Process OMPI jobid: [6932,1] Process rank: 8
> 	Process OMPI jobid: [6932,1] Process rank: 9
> 	Process OMPI jobid: [6932,1] Process rank: 10
> 	Process OMPI jobid: [6932,1] Process rank: 11
> 	Process OMPI jobid: [6932,1] Process rank: 12
> 	Process OMPI jobid: [6932,1] Process rank: 13
> 	Process OMPI jobid: [6932,1] Process rank: 14
> 	Process OMPI jobid: [6932,1] Process rank: 15
>
> =============================================================
> [eng-ipc4:31754] *** Process received signal ***
> [eng-ipc4:31754] Signal: Segmentation fault (11)
> [eng-ipc4:31754] Signal code: Address not mapped (1)
> [eng-ipc4:31754] Failing at address: 0x8012eb748
> [eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0]
> [eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869)  
> [0x7f81cf262869]
> [eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338)  
> [0x7f81cef93338]
> [eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e)  
> [0x7f81cef9397e]
> [eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so. 
> 0(opal_event_loop+0x1f) [0x7f81cef9356f]
> [eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.0(opal_progress 
> +0x89) [0x7f81cef87916]
> [eng-ipc4:31754] [ 6] ~/../openmpi/lib/libopen-rte.so. 
> 0(orte_plm_base_daemon_callback+0x13f) [0x7f81cf262e20]
> [eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7)  
> [0x7f81cf267ed7]
> [eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46]
> [eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4]
> [eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd)  
> [0x7f81ce14bc4d]
> [eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9]
> [eng-ipc4:31754] *** End of error message ***
> salloc: Relinquishing job allocation 145
> salloc: Job allocation 145 has been revoked.
> zsh: exit 1     salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ 
> ServerAdmin/mpi
>
> I've anonymised the paths and domain, otherwise pasted verbatim.   
> The only odd thing I notice is that the launching machine uses its  
> full domain name, whereas the other machine is referred to by the  
> short name.  Despite the FQDN, the domain does not exist in the DNS  
> (for historical reasons), but does exist in the /etc/hosts file.
>
> Any further clues would be appreciated.  In case it may be relevant,  
> core system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32.  One  
> other point of difference may be that our environment is tcp  
> (ethernet) based whereas the LANL test environment is not?
>
> Michael
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users