Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads
From: George Bosilca (bosilca_at_[hidden])
Date: 2009-06-11 15:54:46


Based on the stack trace, at one point (depth 4) we are in the MX MTL
and then we call free. It might happens that two threads call free
simultaneously ... It is a guess, as there is not enough information
to corroborate this.

   george.

On Jun 11, 2009, at 13:17 , Scott Atchley wrote:

> Brian and George,
>
> I do not know if the stack trace is complete, but I do not see any
> mx_* functions called which would indicate a crash inside MX due to
> multiple threads trying to complete the same request. It does show
> an assert failed.
>
> Francois, is the stack trace from the MX MTL or BTL? Can you send a
> small program that reproduces this abort?
>
> Scott
>
>
> On Jun 11, 2009, at 12:25 PM, Brian Barrett wrote:
>
>> Neither the CM PML or the MX MTL has been looked at for thread
>> safety. There's not much code to cause problems in the CM PML.
>> The MX MTL would likely need some work to ensure the restrictions
>> Scott mentioned are met (currently, there's no such guarantee in
>> the MX MTL).
>>
>> Brian
>>
>> On Jun 11, 2009, at 10:21 AM, George Bosilca wrote:
>>
>>> The comment on the FAQ (and on the other thread) is only true for
>>> some BTLs (TCP, SM and MX). I don't have resources to test for the
>>> others BTL, it is their developers responsibility to do the
>>> required modifications to make them thread safe.
>>>
>>> In addition, I have to confess that I never tested the MTL for
>>> thread safety. It is a completely different implementations for
>>> the message passing, supposed to map directly on top of the
>>> underlying network capabilities. However, there are clearly few
>>> places where thread safety should be enforced in the MTL layer,
>>> and I don't know if this is the case.
>>>
>>> george.
>>>
>>> On Jun 11, 2009, at 09:35 , Scott Atchley wrote:
>>>
>>>> Francois,
>>>>
>>>> For threads, the FAQ has:
>>>>
>>>> http://www.open-mpi.org/faq/?category=supported-systems#thread-support
>>>>
>>>> It mentions that thread support is designed in, but lightly
>>>> tested. It is also possible that the FAQ is out of date and
>>>> MPI_THREAD_MULTIPLE is fully supported.
>>>>
>>>> The stack trace below shows:
>>>>
>>>> opal_free()
>>>> opal_progress()
>>>> MPI_Recv()
>>>>
>>>> I do not know this code, but it may be in the higher level code
>>>> that calls the BTLs and/or MTLs and it would be a place to see if
>>>> that code handles the TCP BTL differently than MX BTL/MTL.
>>>>
>>>> MX is thread safe with the caveat that two threads may not try to
>>>> complete the same request at the same time. This includes calling
>>>> mx_test(), mx_wait(), mx_test_any() and/or mx_wait_any() where
>>>> the latter two have match bits and match mask that could complete
>>>> a request being tested/waited by another thread.
>>>>
>>>> Scott
>>>>
>>>> On Jun 11, 2009, at 6:00 AM, François Trahay wrote:
>>>>
>>>>> Well, according to George Bosilca (http://www.open-mpi.org/community/lists/users/2005/02/0005.php
>>>>> ), threads are supported in OpenMPI.
>>>>> The program I try to run works with the TCP stack and MX driver
>>>>> is thread-safe, so i guess the problem comes from the MX BTL or
>>>>> MTL.
>>>>>
>>>>> Francois
>>>>>
>>>>>
>>>>> Scott Atchley wrote:
>>>>>> Hi Francois,
>>>>>>
>>>>>> I am not familiar with the internals of the OMPI code. Are you
>>>>>> sure, however, that threads are fully supported yet? I was
>>>>>> under the impression that thread support was still partial.
>>>>>>
>>>>>> Can anyone else comment?
>>>>>>
>>>>>> Scott
>>>>>>
>>>>>> On Jun 8, 2009, at 8:43 AM, François Trahay wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>> I'm encountering some issues when running a multithreaded
>>>>>>> program with
>>>>>>> OpenMPI (trunk rev. 21380, configured with --enable-mpi-threads)
>>>>>>> My program (included in the tar.bz2) uses several pthreads
>>>>>>> that perform
>>>>>>> ping pongs concurrently (thread #1 uses tag #1, thread #2 uses
>>>>>>> tag #2, etc.)
>>>>>>> This program crashes over MX (either btl or mtl) with the
>>>>>>> following
>>>>>>> backtrace:
>>>>>>>
>>>>>>> concurrent_ping_v2: pml_cm_recvreq.c:53:
>>>>>>> mca_pml_cm_recv_request_completion: Assertion `0 ==
>>>>>>> ((mca_pml_cm_thin_recv_request_t*)base_request)-
>>>>>>> >req_base.req_pml_complete'
>>>>>>> failed.
>>>>>>> [joe0:01709] *** Process received signal ***
>>>>>>> [joe0:01709] *** Process received signal ***
>>>>>>> [joe0:01709] Signal: Segmentation fault (11)
>>>>>>> [joe0:01709] Signal code: Address not mapped (1)
>>>>>>> [joe0:01709] Failing at address: 0x1238949c4
>>>>>>> [joe0:01709] Signal: Aborted (6)
>>>>>>> [joe0:01709] Signal code: (-6)
>>>>>>> [joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0]
>>>>>>> [joe0:01709] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7f5722cba065]
>>>>>>> [joe0:01709] [ 2] /lib/libc.so.6(abort+0x183) [0x7f5722cbd153]
>>>>>>> [joe0:01709] [ 3] /lib/libc.so.6(__assert_fail+0xe9)
>>>>>>> [0x7f5722cb3159]
>>>>>>> [joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0]
>>>>>>> [joe0:01709] [ 1]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-
>>>>>>> pal.so.0
>>>>>>> [0x7f57238d0a08]
>>>>>>> [joe0:01709] [ 2]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-
>>>>>>> pal.so.0
>>>>>>> [0x7f57238cf8cc]
>>>>>>> [joe0:01709] [ 3]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-
>>>>>>> pal.so.0(opal_free+0x4e)
>>>>>>> [0x7f57238bdc69]
>>>>>>> [joe0:01709] [ 4]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>>>> mca_mtl_mx.so
>>>>>>> [0x7f572060b72f]
>>>>>>> [joe0:01709] [ 5]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-
>>>>>>> pal.so.0(opal_progress+0xbc)
>>>>>>> [0x7f57238948e0]
>>>>>>> [joe0:01709] [ 6]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>>>> mca_pml_cm.so
>>>>>>> [0x7f572081145a]
>>>>>>> [joe0:01709] [ 7]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>>>> mca_pml_cm.so
>>>>>>> [0x7f57208113b7]
>>>>>>> [joe0:01709] [ 8]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>>>> mca_pml_cm.so
>>>>>>> [0x7f57208112e7]
>>>>>>> [joe0:01709] [ 9]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libmpi.so.
>>>>>>> 0(MPI_Recv+0x2bc)
>>>>>>> [0x7f5723e07690]
>>>>>>> [joe0:01709] [10] ./concurrent_ping_v2(client+0x123) [0x401404]
>>>>>>> [joe0:01709] [11] /lib/libpthread.so.0 [0x7f57240b6faa]
>>>>>>> [joe0:01709] [12] /lib/libc.so.6(clone+0x6d) [0x7f5722d5629d]
>>>>>>> [joe0:01709] *** End of error message ***
>>>>>>> [joe0:01709] [ 4]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>>>> mca_pml_cm.so
>>>>>>> [0x7f57208120bb]
>>>>>>> [joe0:01709] [ 5]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>>>> mca_mtl_mx.so
>>>>>>> [0x7f572060b80a]
>>>>>>> [joe0:01709] [ 6]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-
>>>>>>> pal.so.0(opal_progress+0xbc)
>>>>>>> [0x7f57238948e0]
>>>>>>> [joe0:01709] [ 7]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>>>> mca_pml_cm.so
>>>>>>> [0x7f572081147a]
>>>>>>> [joe0:01709] [ 8]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>>>> mca_pml_cm.so
>>>>>>> [0x7f57208113b7]
>>>>>>> [joe0:01709] [ 9]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>>>> mca_pml_cm.so
>>>>>>> [0x7f57208112e7]
>>>>>>> [joe0:01709] [10]
>>>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libmpi.so.
>>>>>>> 0(MPI_Recv+0x2bc)
>>>>>>> [0x7f5723e07690]
>>>>>>> [joe0:01709] [11] ./concurrent_ping_v2(client+0x123) [0x401404]
>>>>>>> [joe0:01709] [12] /lib/libpthread.so.0 [0x7f57240b6faa]
>>>>>>> [joe0:01709] [13] /lib/libc.so.6(clone+0x6d) [0x7f5722d5629d]
>>>>>>> [joe0:01709] *** End of error message ***
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun noticed that process rank 1 with PID 1709 on node joe0
>>>>>>> exited on
>>>>>>> signal 6 (Aborted).
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>> Any idea ?
>>>>>>>
>>>>>>> Francois Trahay
>>>>>>>
>>>>>>> <bug-
>>>>>>> report.tar.bz2>_______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users