Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
From: Samuel K. Gutierrez (samuel_at_[hidden])
Date: 2011-02-08 10:17:57


Hi Michael,

You may have tried to send some debug information to the list, but it
appears to have been blocked. Compressed text output of the backtrace
text is sufficient.

Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Feb 7, 2011, at 8:38 AM, Samuel K. Gutierrez wrote:
> Hi,
>
> A detailed backtrace from a core dump may help us debug this.  Would  
> you be willing to provide that information for us?
>
> Thanks,
>
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
>
> On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote:
>
>>
>> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:
>>
>> Hi,
>>
>>> I just tried to reproduce the problem that you are experiencing  
>>> and was unable to.
>>>
>>> SLURM 2.1.15
>>> Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ 
>>> lanl/tlcc/debug-nopanasas
>>
>> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the  
>> same platform file (the only change was to re-enable btl-tcp).
>>
>> Unfortunately, the result is the same:
>> salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi
>> salloc: Granted job allocation 145
>>
>> ========================   JOB MAP   ========================
>>
>> Data for node: Name: eng-ipc4.{FQDN}		Num procs: 8
>> 	Process OMPI jobid: [6932,1] Process rank: 0
>> 	Process OMPI jobid: [6932,1] Process rank: 1
>> 	Process OMPI jobid: [6932,1] Process rank: 2
>> 	Process OMPI jobid: [6932,1] Process rank: 3
>> 	Process OMPI jobid: [6932,1] Process rank: 4
>> 	Process OMPI jobid: [6932,1] Process rank: 5
>> 	Process OMPI jobid: [6932,1] Process rank: 6
>> 	Process OMPI jobid: [6932,1] Process rank: 7
>>
>> Data for node: Name: ipc3	Num procs: 8
>> 	Process OMPI jobid: [6932,1] Process rank: 8
>> 	Process OMPI jobid: [6932,1] Process rank: 9
>> 	Process OMPI jobid: [6932,1] Process rank: 10
>> 	Process OMPI jobid: [6932,1] Process rank: 11
>> 	Process OMPI jobid: [6932,1] Process rank: 12
>> 	Process OMPI jobid: [6932,1] Process rank: 13
>> 	Process OMPI jobid: [6932,1] Process rank: 14
>> 	Process OMPI jobid: [6932,1] Process rank: 15
>>
>> =============================================================
>> [eng-ipc4:31754] *** Process received signal ***
>> [eng-ipc4:31754] Signal: Segmentation fault (11)
>> [eng-ipc4:31754] Signal code: Address not mapped (1)
>> [eng-ipc4:31754] Failing at address: 0x8012eb748
>> [eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0]
>> [eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869)  
>> [0x7f81cf262869]
>> [eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338)  
>> [0x7f81cef93338]
>> [eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e)  
>> [0x7f81cef9397e]
>> [eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so. 
>> 0(opal_event_loop+0x1f) [0x7f81cef9356f]
>> [eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so. 
>> 0(opal_progress+0x89) [0x7f81cef87916]
>> [eng-ipc4:31754] [ 6] ~/../openmpi/lib/libopen-rte.so. 
>> 0(orte_plm_base_daemon_callback+0x13f) [0x7f81cf262e20]
>> [eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7)  
>> [0x7f81cf267ed7]
>> [eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46]
>> [eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4]
>> [eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd)  
>> [0x7f81ce14bc4d]
>> [eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9]
>> [eng-ipc4:31754] *** End of error message ***
>> salloc: Relinquishing job allocation 145
>> salloc: Job allocation 145 has been revoked.
>> zsh: exit 1     salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ 
>> ServerAdmin/mpi
>>
>> I've anonymised the paths and domain, otherwise pasted verbatim.   
>> The only odd thing I notice is that the launching machine uses its  
>> full domain name, whereas the other machine is referred to by the  
>> short name.  Despite the FQDN, the domain does not exist in the DNS  
>> (for historical reasons), but does exist in the /etc/hosts file.
>>
>> Any further clues would be appreciated.  In case it may be  
>> relevant, core system versions are: glibc 2.11, gcc 4.4.3, kernel  
>> 2.6.32.  One other point of difference may be that our environment  
>> is tcp (ethernet) based whereas the LANL test environment is not?
>>
>> Michael
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users