Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2006-10-04 19:51:50


This is the correct patch (same as previous minus the debugging
statements).

   Thanks,
     george.


On Oct 4, 2006, at 7:42 PM, George Bosilca wrote:

> The problem was found and fixed. Until the patch get applied to the
> 1.1 and 1.2 branches please use the attached patch.
>
> Thanks for you help for discovering and fixing this bug,
> george.
>
> <ddt.patch>
>
> On Oct 4, 2006, at 5:32 PM, George Bosilca wrote:
>
>> That's just amazing. We pass all the trapezoidal tests but we fail
>> the general ones (rectangular matrix) if the leading dimension of the
>> matrix on the destination processor is greater than the leading
>> dimension on the sender. At least now I narrow down the place where
>> the error occur ...
>>
>> george.
>>
>> On Oct 4, 2006, at 4:41 PM, George Bosilca wrote:
>>
>>> OK, that was my 5 minutes hall of shame. Setting the verbosity level
>>> in bt.dat to 6 give me enough information to know exactly the data-
>>> type share. Now, I know how to fix things ...
>>>
>>> george.
>>>
>>> On Oct 4, 2006, at 4:35 PM, George Bosilca wrote:
>>>
>>>> I'm working on this bug. As far as I see the patch from the bug 365
>>>> do not help us here. However, on my 64 bits machines (not
>>>> Opteron but
>>>> G5) I don't get the segfault. Anyway, I get the bad data
>>>> transmission
>>>> for test #1 and #51. So far my main problem is that I cannot
>>>> reproduce these errors with any other data-type tests [and
>>>> believe me
>>>> we have a bunch of them]. The only one who fails is the BLACS. I
>>>> wonder what the data-type looks like for the failing tests. Someone
>>>> here knows how to extract the BLACS data-type (for test #1 and
>>>> #51) ?
>>>> Or how to force BLACS to print the data-type information for each
>>>> test (M, N and so on) ?
>>>>
>>>> Thanks,
>>>> george.
>>>>
>>>> On Oct 4, 2006, at 4:13 PM, Michael Kluskens wrote:
>>>>
>>>>> On Oct 4, 2006, at 8:22 AM, Harald Forbert wrote:
>>>>>
>>>>>> The TRANSCOMM setting that we are using here and that I think is
>>>>>> the
>>>>>> correct one is "-DUseMpi2" since OpenMPI implements the
>>>>>> corresponding
>>>>>> mpi2 calls. You need a recent version of BLACS for this setting
>>>>>> to be available (1.1 with patch 3 should be fine). Together with
>>>>>> the
>>>>>> patch to openmpi1.1.1 from ticket 356 we are passing the blacs
>>>>>> tester
>>>>>> for 4 processors. I didn't have to time to test with other
>>>>>> numbers
>>>>>> though.
>>>>>
>>>>> Unfortunately this did not solve the problems I'm seeing, could be
>>>>> that my system is 64 bits (another person seeing problems on an
>>>>> Opteron system).
>>>>>
>>>>> New tests of BLACS 1.1p3 vs. OpenMPI (1.1.1, 1.1.2rc1,
>>>>> 1.3a1r11962)
>>>>> with Intel ifort 9.0.32 and g95 (Sep 27 2006).
>>>>>
>>>>> System: Debian Linux 3.1r3 on dual-opteron, gcc 3.3.5, all tests
>>>>> with
>>>>> 4 processors
>>>>>
>>>>> 1) patched OpenMPI 1.1.1 and 1.1.2rc1 using the two lines from
>>>>> Ticket
>>>>> 356.
>>>>> 2) set TRANSCOMM = -DUseMpi2
>>>>>
>>>>> Intel ifort 9.0.32 tests (INTFACE=-DAdd):
>>>>>
>>>>> OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
>>>>> In the xCbtest both generated errors until Integer Sum tests
>>>>> then
>>>>> no more errors)
>>>>>
>>>>> OpenMPI 1.3a1r11962: no errors until crash:
>>>>>
>>>>> COMPLEX AMX TESTS: BEGIN.
>>>>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>>>>> Failing at addr:0xe62000
>>>>> [0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
>>>>> (opal_backtrace_print+0x1f) [0x2a95aa8c1f]
>>>>> *** End of error message ***
>>>>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>>>>> Failing at addr:0xbc0000
>>>>> [0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
>>>>> (opal_backtrace_print+0x1f) [0x2a95aa8c1f]
>>>>> *** End of error message ***
>>>>>
>>>>> g95 (Sep 27 2006) tests (INTFACE=-Df77IsF2C):
>>>>>
>>>>> OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
>>>>> In the xCbtest both generated errors until Integer Sum tests
>>>>> then
>>>>> no more errors)
>>>>>
>>>>> OpenMPI 1.3a1r11962: no errors until crash:
>>>>>
>>>>> COMPLEX SUM TESTS: BEGIN.
>>>>> COMPLEX SUM TESTS: 1152 TESTS; 864 PASSED, 288 SKIPPED, 0
>>>>> FAILED.
>>>>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>>>>> Failing at addr:0xb6f000
>>>>> [0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0
>>>>> (opal_backtrace_print
>>>>> +0x1f) [0x2a95aa7c1f]
>>>>> *** End of error message ***
>>>>>
>>>>> COMPLEX AMX TESTS: BEGIN.
>>>>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>>>>> Failing at addr:0xe27000
>>>>> [0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0
>>>>> (opal_backtrace_print
>>>>> +0x1f) [0x2a95aa7c1f]
>>>>> *** End of error message ***
>>>>> 3 additional processes aborted (not shown)
>>>>>
>>>>>
>>>>> Michael
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users


  • application/octet-stream attachment: ddt.patch