Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-02-08 10:38:44


Another possibility to check - are you sure you are getting the same OMPI version on the backend nodes? When I see it work on local node, but fail multi-node, the most common problem is that you are picking up a different OMPI version due to path differences on the backend nodes.

On Feb 8, 2011, at 8:17 AM, Samuel K. Gutierrez wrote:

> Hi Michael,
>
> You may have tried to send some debug information to the list, but it appears to have been blocked. Compressed text output of the backtrace text is sufficient.
>
> Thanks,
>
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
>
> On Feb 7, 2011, at 8:38 AM, Samuel K. Gutierrez wrote:
>
>> Hi,
>>
>> A detailed backtrace from a core dump may help us debug this. Would you be willing to provide that information for us?
>>
>> Thanks,
>>
>> --
>> Samuel K. Gutierrez
>> Los Alamos National Laboratory
>>
>> On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote:
>>
>>>
>>> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:
>>>
>>> Hi,
>>>
>>>> I just tried to reproduce the problem that you are experiencing and was unable to.
>>>>
>>>> SLURM 2.1.15
>>>> Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas
>>>
>>> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same platform file (the only change was to re-enable btl-tcp).
>>>
>>> Unfortunately, the result is the same:
>>> salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi
>>> salloc: Granted job allocation 145
>>>
>>> ======================== JOB MAP ========================
>>>
>>> Data for node: Name: eng-ipc4.{FQDN} Num procs: 8
>>> Process OMPI jobid: [6932,1] Process rank: 0
>>> Process OMPI jobid: [6932,1] Process rank: 1
>>> Process OMPI jobid: [6932,1] Process rank: 2
>>> Process OMPI jobid: [6932,1] Process rank: 3
>>> Process OMPI jobid: [6932,1] Process rank: 4
>>> Process OMPI jobid: [6932,1] Process rank: 5
>>> Process OMPI jobid: [6932,1] Process rank: 6
>>> Process OMPI jobid: [6932,1] Process rank: 7
>>>
>>> Data for node: Name: ipc3 Num procs: 8
>>> Process OMPI jobid: [6932,1] Process rank: 8
>>> Process OMPI jobid: [6932,1] Process rank: 9
>>> Process OMPI jobid: [6932,1] Process rank: 10
>>> Process OMPI jobid: [6932,1] Process rank: 11
>>> Process OMPI jobid: [6932,1] Process rank: 12
>>> Process OMPI jobid: [6932,1] Process rank: 13
>>> Process OMPI jobid: [6932,1] Process rank: 14
>>> Process OMPI jobid: [6932,1] Process rank: 15
>>>
>>> =============================================================
>>> [eng-ipc4:31754] *** Process received signal ***
>>> [eng-ipc4:31754] Signal: Segmentation fault (11)
>>> [eng-ipc4:31754] Signal code: Address not mapped (1)
>>> [eng-ipc4:31754] Failing at address: 0x8012eb748
>>> [eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0]
>>> [eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869) [0x7f81cf262869]
>>> [eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338) [0x7f81cef93338]
>>> [eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e) [0x7f81cef9397e]
>>> [eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f) [0x7f81cef9356f]
>>> [eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.0(opal_progress+0x89) [0x7f81cef87916]
>>> [eng-ipc4:31754] [ 6] ~/../openmpi/lib/libopen-rte.so.0(orte_plm_base_daemon_callback+0x13f) [0x7f81cf262e20]
>>> [eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7) [0x7f81cf267ed7]
>>> [eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46]
>>> [eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4]
>>> [eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd) [0x7f81ce14bc4d]
>>> [eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9]
>>> [eng-ipc4:31754] *** End of error message ***
>>> salloc: Relinquishing job allocation 145
>>> salloc: Job allocation 145 has been revoked.
>>> zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi
>>>
>>> I've anonymised the paths and domain, otherwise pasted verbatim. The only odd thing I notice is that the launching machine uses its full domain name, whereas the other machine is referred to by the short name. Despite the FQDN, the domain does not exist in the DNS (for historical reasons), but does exist in the /etc/hosts file.
>>>
>>> Any further clues would be appreciated. In case it may be relevant, core system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32. One other point of difference may be that our environment is tcp (ethernet) based whereas the LANL test environment is not?
>>>
>>> Michael
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users