Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads
From: George Bosilca (bosilca_at_[hidden])
Date: 2009-06-11 12:21:54


The comment on the FAQ (and on the other thread) is only true for some
BTLs (TCP, SM and MX). I don't have resources to test for the others
BTL, it is their developers responsibility to do the required
modifications to make them thread safe.

In addition, I have to confess that I never tested the MTL for thread
safety. It is a completely different implementations for the message
passing, supposed to map directly on top of the underlying network
capabilities. However, there are clearly few places where thread
safety should be enforced in the MTL layer, and I don't know if this
is the case.

   george.

On Jun 11, 2009, at 09:35 , Scott Atchley wrote:

> Francois,
>
> For threads, the FAQ has:
>
> http://www.open-mpi.org/faq/?category=supported-systems#thread-support
>
> It mentions that thread support is designed in, but lightly tested.
> It is also possible that the FAQ is out of date and
> MPI_THREAD_MULTIPLE is fully supported.
>
> The stack trace below shows:
>
> opal_free()
> opal_progress()
> MPI_Recv()
>
> I do not know this code, but it may be in the higher level code that
> calls the BTLs and/or MTLs and it would be a place to see if that
> code handles the TCP BTL differently than MX BTL/MTL.
>
> MX is thread safe with the caveat that two threads may not try to
> complete the same request at the same time. This includes calling
> mx_test(), mx_wait(), mx_test_any() and/or mx_wait_any() where the
> latter two have match bits and match mask that could complete a
> request being tested/waited by another thread.
>
> Scott
>
> On Jun 11, 2009, at 6:00 AM, François Trahay wrote:
>
>> Well, according to George Bosilca (http://www.open-mpi.org/community/lists/users/2005/02/0005.php
>> ), threads are supported in OpenMPI.
>> The program I try to run works with the TCP stack and MX driver is
>> thread-safe, so i guess the problem comes from the MX BTL or MTL.
>>
>> Francois
>>
>>
>> Scott Atchley wrote:
>>> Hi Francois,
>>>
>>> I am not familiar with the internals of the OMPI code. Are you
>>> sure, however, that threads are fully supported yet? I was under
>>> the impression that thread support was still partial.
>>>
>>> Can anyone else comment?
>>>
>>> Scott
>>>
>>> On Jun 8, 2009, at 8:43 AM, François Trahay wrote:
>>>
>>>> Hi,
>>>> I'm encountering some issues when running a multithreaded program
>>>> with
>>>> OpenMPI (trunk rev. 21380, configured with --enable-mpi-threads)
>>>> My program (included in the tar.bz2) uses several pthreads that
>>>> perform
>>>> ping pongs concurrently (thread #1 uses tag #1, thread #2 uses
>>>> tag #2, etc.)
>>>> This program crashes over MX (either btl or mtl) with the following
>>>> backtrace:
>>>>
>>>> concurrent_ping_v2: pml_cm_recvreq.c:53:
>>>> mca_pml_cm_recv_request_completion: Assertion `0 ==
>>>> ((mca_pml_cm_thin_recv_request_t*)base_request)-
>>>> >req_base.req_pml_complete'
>>>> failed.
>>>> [joe0:01709] *** Process received signal ***
>>>> [joe0:01709] *** Process received signal ***
>>>> [joe0:01709] Signal: Segmentation fault (11)
>>>> [joe0:01709] Signal code: Address not mapped (1)
>>>> [joe0:01709] Failing at address: 0x1238949c4
>>>> [joe0:01709] Signal: Aborted (6)
>>>> [joe0:01709] Signal code: (-6)
>>>> [joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0]
>>>> [joe0:01709] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7f5722cba065]
>>>> [joe0:01709] [ 2] /lib/libc.so.6(abort+0x183) [0x7f5722cbd153]
>>>> [joe0:01709] [ 3] /lib/libc.so.6(__assert_fail+0xe9)
>>>> [0x7f5722cb3159]
>>>> [joe0:01709] [ 0] /lib/libpthread.so.0 [0x7f57240be7b0]
>>>> [joe0:01709] [ 1]
>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0
>>>> [0x7f57238d0a08]
>>>> [joe0:01709] [ 2]
>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.0
>>>> [0x7f57238cf8cc]
>>>> [joe0:01709] [ 3]
>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.
>>>> 0(opal_free+0x4e)
>>>> [0x7f57238bdc69]
>>>> [joe0:01709] [ 4]
>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>> mca_mtl_mx.so
>>>> [0x7f572060b72f]
>>>> [joe0:01709] [ 5]
>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.
>>>> 0(opal_progress+0xbc)
>>>> [0x7f57238948e0]
>>>> [joe0:01709] [ 6]
>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>> mca_pml_cm.so
>>>> [0x7f572081145a]
>>>> [joe0:01709] [ 7]
>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>> mca_pml_cm.so
>>>> [0x7f57208113b7]
>>>> [joe0:01709] [ 8]
>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>> mca_pml_cm.so
>>>> [0x7f57208112e7]
>>>> [joe0:01709] [ 9]
>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libmpi.so.
>>>> 0(MPI_Recv+0x2bc)
>>>> [0x7f5723e07690]
>>>> [joe0:01709] [10] ./concurrent_ping_v2(client+0x123) [0x401404]
>>>> [joe0:01709] [11] /lib/libpthread.so.0 [0x7f57240b6faa]
>>>> [joe0:01709] [12] /lib/libc.so.6(clone+0x6d) [0x7f5722d5629d]
>>>> [joe0:01709] *** End of error message ***
>>>> [joe0:01709] [ 4]
>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>> mca_pml_cm.so
>>>> [0x7f57208120bb]
>>>> [joe0:01709] [ 5]
>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>> mca_mtl_mx.so
>>>> [0x7f572060b80a]
>>>> [joe0:01709] [ 6]
>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libopen-pal.so.
>>>> 0(opal_progress+0xbc)
>>>> [0x7f57238948e0]
>>>> [joe0:01709] [ 7]
>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>> mca_pml_cm.so
>>>> [0x7f572081147a]
>>>> [joe0:01709] [ 8]
>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>> mca_pml_cm.so
>>>> [0x7f57208113b7]
>>>> [joe0:01709] [ 9]
>>>> /home/ftrahay/sources/openmpi/trunk/install/lib/openmpi/
>>>> mca_pml_cm.so
>>>> [0x7f57208112e7]
>>>> [joe0:01709] [10]
>>>> /home/ftrahay/sources/openmpi/trunk/install//lib/libmpi.so.
>>>> 0(MPI_Recv+0x2bc)
>>>> [0x7f5723e07690]
>>>> [joe0:01709] [11] ./concurrent_ping_v2(client+0x123) [0x401404]
>>>> [joe0:01709] [12] /lib/libpthread.so.0 [0x7f57240b6faa]
>>>> [joe0:01709] [13] /lib/libc.so.6(clone+0x6d) [0x7f5722d5629d]
>>>> [joe0:01709] *** End of error message ***
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that process rank 1 with PID 1709 on node joe0
>>>> exited on
>>>> signal 6 (Aborted).
>>>> --------------------------------------------------------------------------
>>>>
>>>>
>>>> Any idea ?
>>>>
>>>> Francois Trahay
>>>>
>>>> <bug-report.tar.bz2>_______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users