Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] regression with derived datatypes
From: Rolf vandeVaart (rvandevaart_at_[hidden])
Date: 2014-05-30 08:58:16

This fixed all of my issues. Thanks. I will add that comment to ticket also.

>-----Original Message-----
>From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of George
>Sent: Thursday, May 29, 2014 5:58 PM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] regression with derived datatypes
>r31904 should fix this issue. Please test it thoughtfully and report all issues.
> George.
>On Fri, May 9, 2014 at 6:56 AM, Gilles Gouaillardet
><gilles.gouaillardet_at_[hidden]> wrote:
>> i opened #4610
>> and attached a patch for the v1.8 branch
>> i ran several tests from the intel_tests test suite and did not
>> observe any regression.
>> please note there are still issues when running with --mca btl
>> scif,vader,self
>> this might be an other issue, i will investigate more next week
>> Gilles
>> On 2014/05/09 18:08, Gilles Gouaillardet wrote:
>>> I ran some more investigations with --mca btl scif,self
>>> i found that the previous patch i posted was complete crap and i
>>> apologize for it.
>>> on a brighter side, and imho, the issue only occurs if fragments are
>>> received (and then processed) out of order.
>>> /* i did not observe this with the tcp btl, but i always see that
>>> with the scif btl, i guess this can be observed too with openib+RDMA
>>> */
>>> in this case only, opal_convertor_generic_simple_position(...) is
>>> invoked and does not set the pConvertor->pStack as expected by r31496
>>> i will run some more tests from now
>>> Gilles
>>> On 2014/05/08 2:23, George Bosilca wrote:
>>>> Strange. The outcome and the timing of this issue seems to highlight a link
>with the other datatype-related issue you reported earlier, and as suggested
>by Ralph with Gilles scif+vader issue.
>>>> Generally speaking, the mechanism used to split the data in the case of
>multiple BTLs, is identical to the one used to split the data in fragments. So, if
>the culprit is in the splitting logic, one might see some weirdness as soon as
>we force the exclusive usage of the send protocol, with an unconventional
>fragment size.
>>>> In other words using the following flags “—mca btl tcp,self —mca
>btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit
>23 —mca btl_tcp_max_send_size 23” should always transfer wrong data,
>even when only one single BTL is in play.
>>>> George.
>>>> On May 7, 2014, at 13:11 , Rolf vandeVaart <rvandevaart_at_[hidden]>
>>>>> OK. So, I investigated a little more. I only see the issue when I am
>running with multiple ports enabled such that I have two openib BTLs
>instantiated. In addition, large message RDMA has to be enabled. If those
>conditions are not met, then I do not see the problem. For example:
>>>>> FAILS:
>>>>> Ø mpirun –np 2 –host host1,host2 –mca btl_openib_if_include
>>>>> mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c
>>>>> PASS:
>>>>> Ø mpirun –np 2 –host host1,host2 –mca btl_openib_if_include
>>>>> mlx5_0:1 –mca btl_openib_flags 3 MPI_Isend_ator_c Ø mpirun –np 2
>>>>> –host host1,host2 –mca btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca
>>>>> btl_openib_flags 1 MPI_Isend_ator_c
>>>>> So we must have some type of issue when we break up the message
>between the two openib BTLs. Maybe someone else can confirm my
>>>>> I was testing against the latest trunk.
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> Subscription:
>> Link to this post:
>devel mailing list
>Link to this post:

This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.