Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads
From: Brian Barrett (brbarret_at_[hidden])
Date: 2009-06-11 12:25:05


Neither the CM PML or the MX MTL has been looked at for thread
safety. There's not much code to cause problems in the CM PML. The
MX MTL would likely need some work to ensure the restrictions Scott
mentioned are met (currently, there's no such guarantee in the MX MTL).

Brian

On Jun 11, 2009, at 10:21 AM, George Bosilca wrote:

> The comment on the FAQ (and on the other thread) is only true for
> some BTLs (TCP, SM and MX). I don't have resources to test for the
> others BTL, it is their developers responsibility to do the required
> modifications to make them thread safe.
>
> In addition, I have to confess that I never tested the MTL for
> thread safety. It is a completely different implementations for the
> message passing, supposed to map directly on top of the underlying
> network capabilities. However, there are clearly few places where
> thread safety should be enforced in the MTL layer, and I don't know
> if this is the case.
>
> george.
>
> On Jun 11, 2009, at 09:35 , Scott Atchley wrote:
>
>> Francois,
>>
>> For threads, the FAQ has:
>>
>> http://www.open-mpi.org/faq/?category=supported-systems#thread-
>> support
>>
>> It mentions that thread support is designed in, but lightly tested.
>> It is also possible that the FAQ is out of date and
>> MPI_THREAD_MULTIPLE is fully supported.
>>
>> The stack trace below shows:
>>
>> opal_free()
>> opal_progress()
>> MPI_Recv()
>>
>> I do not know this code, but it may be in the higher level code
>> that calls the BTLs and/or MTLs and it would be a place to see if
>> that code handles the TCP BTL differently than MX BTL/MTL.
>>
>> MX is thread safe with the caveat that two threads may not try to
>> complete the same request at the same time. This includes calling
>> mx_test(), mx_wait(), mx_test_any() and/or mx_wait_any() where the
>> latter two have match bits and match mask that could complete a
>> request being tested/waited by another thread.
>>
>> Scott
>>
>> On Jun 11, 2009, at 6:00 AM, François Trahay wrote:
>>
>>> Well, according to George Bosilca (http://www.open-mpi.org/community/lists/users/2005/02/0005.php
>>> ), threads are supported in OpenMPI.
>>> The program I try to run works with the TCP stack and MX driver is
>>> thread-safe, so i guess the problem comes from the MX BTL or MTL.
>>>
>>> Francois
>>>
>>>
>>> Scott Atchley wrote:
>>>> Hi Francois,
>>>>
>>>> I am not familiar with the internals of the OMPI code. Are you
>>>> sure, however, that threads are fully supported yet? I was under
>>>> the impression that thread support was still partial.
>>>>
>>>> Can anyone else comment?
>>>>
>>>> Scott
>>>>
>>>> On Jun 8, 2009, at 8:43 AM, François Trahay wrote:
>>>>
>>>>> Hi,
>>>>> I'm encountering some issues when running a multithreaded
>>>>> program with
>>>>> OpenMPI (trunk rev. 21380, configured with --enable-mpi-threads)
>>>>> My program (included in the tar.bz2) uses several pthreads that
>>>>> perform
>>>>> ping pongs concurrently (thread #1 uses tag #1, thread #2 uses
>>>>> tag #2, etc.)
>>>>> This program crashes over MX (either btl or mtl) with the
>>>>> following
>>>>> backtrace:
>>>>>
>>>>> concurrent_ping_v2: pml_cm_recvreq.c:53:
>>>>> mca_pml_cm_recv_request_completion: Assertion `0 ==
>>>>> ((mca_pml_cm_thin_recv_request_t*)base_request)-
>>>>> >req_base.req_pml_complete'
>>>>> failed.
>>>>> [joe0:01709] *** Process received signal ***
>>>>> [joe0:01709] *** Process received signal ***
>>>>> [joe0:01709] Signal: Segmentation fault (11)
>>>>> [joe0:01709] Signal code: Address not mapped (1)
>>>>> [joe0:01709] Failing at address: 0x1238949c4
>>>>> [joe0:01709] Signal: Aborted (6)
>>>>> [joe0:01709] Signal code: (-6)
>>>>> [joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0]
>>>>> [joe0:01709] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7f5722cba065]
>>>>> [joe0:01709] [ 2] /lib/libc.so.6(abort+0x183) [0x7f5722cbd153]
>>>>> [joe0:01709] [ 3] /lib/libc.so.6(__assert_fail+0xe9)
>>>>> [0x7f5722cb3159]
>>>>> [joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0]
>>>>> [joe0:01709] [ 1]
>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0
>>>>> [0x7f57238d0a08]
>>>>> [joe0:01709] [ 2]
>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0
>>>>> [0x7f57238cf8cc]
>>>>> [joe0:01709] [ 3]
>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.
>>>>> 0(opal_free+0x4e)
>>>>> [0x7f57238bdc69]
>>>>> [joe0:01709] [ 4]
>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>> mca_mtl_mx.so
>>>>> [0x7f572060b72f]
>>>>> [joe0:01709] [ 5]
>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.
>>>>> 0(opal_progress+0xbc)
>>>>> [0x7f57238948e0]
>>>>> [joe0:01709] [ 6]
>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>> mca_pml_cm.so
>>>>> [0x7f572081145a]
>>>>> [joe0:01709] [ 7]
>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>> mca_pml_cm.so
>>>>> [0x7f57208113b7]
>>>>> [joe0:01709] [ 8]
>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>> mca_pml_cm.so
>>>>> [0x7f57208112e7]
>>>>> [joe0:01709] [ 9]
>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libmpi.so.
>>>>> 0(MPI_Recv+0x2bc)
>>>>> [0x7f5723e07690]
>>>>> [joe0:01709] [10] ./concurrent_ping_v2(client+0x123) [0x401404]
>>>>> [joe0:01709] [11] /lib/libpthread.so.0 [0x7f57240b6faa]
>>>>> [joe0:01709] [12] /lib/libc.so.6(clone+0x6d) [0x7f5722d5629d]
>>>>> [joe0:01709] *** End of error message ***
>>>>> [joe0:01709] [ 4]
>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>> mca_pml_cm.so
>>>>> [0x7f57208120bb]
>>>>> [joe0:01709] [ 5]
>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>> mca_mtl_mx.so
>>>>> [0x7f572060b80a]
>>>>> [joe0:01709] [ 6]
>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.
>>>>> 0(opal_progress+0xbc)
>>>>> [0x7f57238948e0]
>>>>> [joe0:01709] [ 7]
>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>> mca_pml_cm.so
>>>>> [0x7f572081147a]
>>>>> [joe0:01709] [ 8]
>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>> mca_pml_cm.so
>>>>> [0x7f57208113b7]
>>>>> [joe0:01709] [ 9]
>>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>>> mca_pml_cm.so
>>>>> [0x7f57208112e7]
>>>>> [joe0:01709] [10]
>>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libmpi.so.
>>>>> 0(MPI_Recv+0x2bc)
>>>>> [0x7f5723e07690]
>>>>> [joe0:01709] [11] ./concurrent_ping_v2(client+0x123) [0x401404]
>>>>> [joe0:01709] [12] /lib/libpthread.so.0 [0x7f57240b6faa]
>>>>> [joe0:01709] [13] /lib/libc.so.6(clone+0x6d) [0x7f5722d5629d]
>>>>> [joe0:01709] *** End of error message ***
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that process rank 1 with PID 1709 on node joe0
>>>>> exited on
>>>>> signal 6 (Aborted).
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> Any idea ?
>>>>>
>>>>> Francois Trahay
>>>>>
>>>>> <bug-
>>>>> report.tar.bz2>_______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>