Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] segfault issue - possible bug in openmpi
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-10-06 14:10:10


Yes, there still could be a dependence on a number of processors and
using threads. But it's not clear from the stack trace if this is a
threaded problem or not (and it is correct that OMPI v1.2's thread
support is non-functional).

As for more information that would help diagnose the problem, please see

     http://www.open-mpi.org/community/help/

Thanks.

On Oct 4, 2008, at 9:28 PM, Doug Reeder wrote:

> Shafagh,
>
> I missed the dependence on the number of processors. Apparently
> there is some thread support.
>
> Doug
> On Oct 4, 2008, at 5:29 PM, Shafagh Jafer wrote:
>
>> Doug Reeder,
>> Daniel is saying that the problem only occurs in openmpi when
>> running more than 16 processes. So could that still be cause
>> becasue openmpi does not support threads??!!
>>
>> --- On Fri, 10/3/08, Doug Reeder <dlr_at_[hidden]> wrote:
>> From: Doug Reeder <dlr_at_[hidden]>
>> Subject: Re: [OMPI users] segfault issue - possible bug in openmpi
>> To: "Open MPI Users" <users_at_[hidden]>
>> Date: Friday, October 3, 2008, 2:40 PM
>>
>> Daniel,
>>
>> Are you using threads. I don't think the opempi-1.2.x work with
>> threads.
>>
>> Doug Reeder
>> On Oct 3, 2008, at 2:30 PM, Daniel Hansen wrote:
>>
>>> Oh, by the way, here is the segfault:
>>>
>>> [m4b-1-8:11481] *** Process received signal ***
>>> [m4b-1-8:11481] Signal: Segmentation fault (11)
>>> [m4b-1-8:11481] Signal code: Address not mapped (1)
>>> [m4b-1-8:11481] Failing at address: 0x2b91c69eed
>>> [m4b-1-8:11483] [ 0] /lib64/libpthread.so.0 [0x33e8c0de70]
>>> [m4b-1-8:11483] [ 1] /fslhome/dhansen7/openmpi/lib/libmpi.so.0
>>> [0x2aaaaabea7c0]
>>> [m4b-1-8:11483] [ 2] /fslhome/dhansen7/openmpi/lib/libmpi.so.0
>>> [0x2aaaaabea675]
>>> [m4b-1-8:11483] [ 3] /fslhome/dhansen7/openmpi/lib/libmpi.so.
>>> 0(mca_pml_ob1_send+0x2da) [0x2aaaaabeaf55]
>>> [m4b-1-8:11483] [ 4] /fslhome/dhansen7/openmpi/lib/libmpi.so.
>>> 0(MPI_Send+0x28e) [0x2aaaaab52c5a]
>>> [m4b-1-8:11483] [ 5] /fslhome/dhansen7/compute/for_DanielHansen/
>>> replica_mpi_marylou2/Openmpi_md_twham(twham_init+0x708) [0x42a8a8]
>>> [m4b-1-8:11483] [ 6] /fslhome/dhansen7/compute/for_DanielHansen/
>>> replica_mpi_marylou2/Openmpi_md_twham(repexch+0x73c) [0x425d5c]
>>> [m4b-1-8:11483] [ 7] /fslhome/dhansen7/compute/for_DanielHansen/
>>> replica_mpi_marylou2/Openmpi_md_twham(main+0x855) [0x4133a5]
>>> [m4b-1-8:11483] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4)
>>> [0x33e841d8a4]
>>> [m4b-1-8:11483] [ 9] /fslhome/dhansen7/compute/for_DanielHansen/
>>> replica_mpi_marylou2/Openmpi_md_twham [0x4040b9]
>>> [m4b-1-8:11483] *** End of error message ***
>>>
>>>
>>>
>>> On Fri, Oct 3, 2008 at 3:20 PM, Daniel Hansen <dhansen_at_[hidden]>
>>> wrote:
>>> I have been testing some code against openmpi lately that always
>>> causes it to crash during certain mpi function calls. The code
>>> does not seem to be the problem, as it runs just fine against
>>> mpich. I have tested it against openmpi 1.2.5, 1.2.6, and 1.2.7
>>> and they all exhibit the same problem. Also, the problem only
>>> occurs in openmpi when running more than 16 processes. I have
>>> posted this stack trace to the list before, but I am submitting it
>>> now as a potential bug report. I need some help debugging it and
>>> finding out exactly what is going on in openmpi when the segfault
>>> occurs. Are there any suggestions on how best to do this? Is
>>> there an easy way to attach gdb to one of the processes or
>>> something?? I have already compiled openmpi with debugging,
>>> memory profiling, etc. How can I best take advantage of these
>>> features?
>>>
>>> Thanks,
>>> Daniel Hansen
>>> Systems Administrator
>>> BYU Fulton Supercomputing Lab
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems