Hi Michael,
You may have tried to send some debug information to the list, but it
appears to have been blocked. Compressed text output of the backtrace
text is sufficient.
Thanks,
--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Feb 7, 2011, at 8:38 AM, Samuel K. Gutierrez wrote:
> Hi,
>
> A detailed backtrace from a core dump may help us debug this. Would
> you be willing to provide that information for us?
>
> Thanks,
>
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
>
> On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote:
>
>>
>> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:
>>
>> Hi,
>>
>>> I just tried to reproduce the problem that you are experiencing
>>> and was unable to.
>>>
>>> SLURM 2.1.15
>>> Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/
>>> lanl/tlcc/debug-nopanasas
>>
>> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the
>> same platform file (the only change was to re-enable btl-tcp).
>>
>> Unfortunately, the result is the same:
>> salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi
>> salloc: Granted job allocation 145
>>
>> ======================== JOB MAP ========================
>>
>> Data for node: Name: eng-ipc4.{FQDN} Num procs: 8
>> Process OMPI jobid: [6932,1] Process rank: 0
>> Process OMPI jobid: [6932,1] Process rank: 1
>> Process OMPI jobid: [6932,1] Process rank: 2
>> Process OMPI jobid: [6932,1] Process rank: 3
>> Process OMPI jobid: [6932,1] Process rank: 4
>> Process OMPI jobid: [6932,1] Process rank: 5
>> Process OMPI jobid: [6932,1] Process rank: 6
>> Process OMPI jobid: [6932,1] Process rank: 7
>>
>> Data for node: Name: ipc3 Num procs: 8
>> Process OMPI jobid: [6932,1] Process rank: 8
>> Process OMPI jobid: [6932,1] Process rank: 9
>> Process OMPI jobid: [6932,1] Process rank: 10
>> Process OMPI jobid: [6932,1] Process rank: 11
>> Process OMPI jobid: [6932,1] Process rank: 12
>> Process OMPI jobid: [6932,1] Process rank: 13
>> Process OMPI jobid: [6932,1] Process rank: 14
>> Process OMPI jobid: [6932,1] Process rank: 15
>>
>> =============================================================
>> [eng-ipc4:31754] *** Process received signal ***
>> [eng-ipc4:31754] Signal: Segmentation fault (11)
>> [eng-ipc4:31754] Signal code: Address not mapped (1)
>> [eng-ipc4:31754] Failing at address: 0x8012eb748
>> [eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0]
>> [eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869)
>> [0x7f81cf262869]
>> [eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338)
>> [0x7f81cef93338]
>> [eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e)
>> [0x7f81cef9397e]
>> [eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so.
>> 0(opal_event_loop+0x1f) [0x7f81cef9356f]
>> [eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.
>> 0(opal_progress+0x89) [0x7f81cef87916]
>> [eng-ipc4:31754] [ 6] ~/../openmpi/lib/libopen-rte.so.
>> 0(orte_plm_base_daemon_callback+0x13f) [0x7f81cf262e20]
>> [eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7)
>> [0x7f81cf267ed7]
>> [eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46]
>> [eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4]
>> [eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd)
>> [0x7f81ce14bc4d]
>> [eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9]
>> [eng-ipc4:31754] *** End of error message ***
>> salloc: Relinquishing job allocation 145
>> salloc: Job allocation 145 has been revoked.
>> zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/
>> ServerAdmin/mpi
>>
>> I've anonymised the paths and domain, otherwise pasted verbatim.
>> The only odd thing I notice is that the launching machine uses its
>> full domain name, whereas the other machine is referred to by the
>> short name. Despite the FQDN, the domain does not exist in the DNS
>> (for historical reasons), but does exist in the /etc/hosts file.
>>
>> Any further clues would be appreciated. In case it may be
>> relevant, core system versions are: glibc 2.11, gcc 4.4.3, kernel
>> 2.6.32. One other point of difference may be that our environment
>> is tcp (ethernet) based whereas the LANL test environment is not?
>>
>> Michael
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
|