This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
r31904 should fix this issue. Please test it thoughtfully and report all issues.
On Fri, May 9, 2014 at 6:56 AM, Gilles Gouaillardet
> i opened #4610 https://svn.open-mpi.org/trac/ompi/ticket/4610
> and attached a patch for the v1.8 branch
> i ran several tests from the intel_tests test suite and did not observe
> any regression.
> please note there are still issues when running with --mca btl
> this might be an other issue, i will investigate more next week
> On 2014/05/09 18:08, Gilles Gouaillardet wrote:
>> I ran some more investigations with --mca btl scif,self
>> i found that the previous patch i posted was complete crap and i
>> apologize for it.
>> on a brighter side, and imho, the issue only occurs if fragments are
>> received (and then processed) out of order.
>> /* i did not observe this with the tcp btl, but i always see that with
>> the scif btl, i guess this can be observed too
>> with openib+RDMA */
>> in this case only, opal_convertor_generic_simple_position(...) is
>> invoked and does not set the pConvertor->pStack
>> as expected by r31496
>> i will run some more tests from now
>> On 2014/05/08 2:23, George Bosilca wrote:
>>> Strange. The outcome and the timing of this issue seems to highlight a link with the other datatype-related issue you reported earlier, and as suggested by Ralph with Gilles scif+vader issue.
>>> Generally speaking, the mechanism used to split the data in the case of multiple BTLs, is identical to the one used to split the data in fragments. So, if the culprit is in the splitting logic, one might see some weirdness as soon as we force the exclusive usage of the send protocol, with an unconventional fragment size.
>>> In other words using the following flags ââmca btl tcp,self âmca btl_tcp_flags 3 âmca btl_tcp_rndv_eager_limit 23 âmca btl_tcp_eager_limit 23 âmca btl_tcp_max_send_size 23â should always transfer wrong data, even when only one single BTL is in play.
>>> On May 7, 2014, at 13:11 , Rolf vandeVaart <rvandevaart_at_[hidden]> wrote:
>>>> OK. So, I investigated a little more. I only see the issue when I am running with multiple ports enabled such that I have two openib BTLs instantiated. In addition, large message RDMA has to be enabled. If those conditions are not met, then I do not see the problem. For example:
>>>> Ã mpirun ânp 2 âhost host1,host2 âmca btl_openib_if_include mlx5_0:1,mlx5_0:2 âmca btl_openib_flags 3 MPI_Isend_ator_c
>>>> Ã mpirun ânp 2 âhost host1,host2 âmca btl_openib_if_include mlx5_0:1 âmca btl_openib_flags 3 MPI_Isend_ator_c
>>>> Ã mpirun ânp 2 âhost host1,host2 âmca btl_openib_if_include_mlx5:0:1,mlx5_0:2 âmca btl_openib_flags 1 MPI_Isend_ator_c
>>>> So we must have some type of issue when we break up the message between the two openib BTLs. Maybe someone else can confirm my observations?
>>>> I was testing against the latest trunk.
> devel mailing list
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14766.php