Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2006-10-04 19:42:16


The problem was found and fixed. Until the patch get applied to the
1.1 and 1.2 branches please use the attached patch.

   Thanks for you help for discovering and fixing this bug,
     george.


On Oct 4, 2006, at 5:32 PM, George Bosilca wrote:

> That's just amazing. We pass all the trapezoidal tests but we fail
> the general ones (rectangular matrix) if the leading dimension of the
> matrix on the destination processor is greater than the leading
> dimension on the sender. At least now I narrow down the place where
> the error occur ...
>
> george.
>
> On Oct 4, 2006, at 4:41 PM, George Bosilca wrote:
>
>> OK, that was my 5 minutes hall of shame. Setting the verbosity level
>> in bt.dat to 6 give me enough information to know exactly the data-
>> type share. Now, I know how to fix things ...
>>
>> george.
>>
>> On Oct 4, 2006, at 4:35 PM, George Bosilca wrote:
>>
>>> I'm working on this bug. As far as I see the patch from the bug 365
>>> do not help us here. However, on my 64 bits machines (not Opteron
>>> but
>>> G5) I don't get the segfault. Anyway, I get the bad data
>>> transmission
>>> for test #1 and #51. So far my main problem is that I cannot
>>> reproduce these errors with any other data-type tests [and
>>> believe me
>>> we have a bunch of them]. The only one who fails is the BLACS. I
>>> wonder what the data-type looks like for the failing tests. Someone
>>> here knows how to extract the BLACS data-type (for test #1 and
>>> #51) ?
>>> Or how to force BLACS to print the data-type information for each
>>> test (M, N and so on) ?
>>>
>>> Thanks,
>>> george.
>>>
>>> On Oct 4, 2006, at 4:13 PM, Michael Kluskens wrote:
>>>
>>>> On Oct 4, 2006, at 8:22 AM, Harald Forbert wrote:
>>>>
>>>>> The TRANSCOMM setting that we are using here and that I think is
>>>>> the
>>>>> correct one is "-DUseMpi2" since OpenMPI implements the
>>>>> corresponding
>>>>> mpi2 calls. You need a recent version of BLACS for this setting
>>>>> to be available (1.1 with patch 3 should be fine). Together with
>>>>> the
>>>>> patch to openmpi1.1.1 from ticket 356 we are passing the blacs
>>>>> tester
>>>>> for 4 processors. I didn't have to time to test with other numbers
>>>>> though.
>>>>
>>>> Unfortunately this did not solve the problems I'm seeing, could be
>>>> that my system is 64 bits (another person seeing problems on an
>>>> Opteron system).
>>>>
>>>> New tests of BLACS 1.1p3 vs. OpenMPI (1.1.1, 1.1.2rc1, 1.3a1r11962)
>>>> with Intel ifort 9.0.32 and g95 (Sep 27 2006).
>>>>
>>>> System: Debian Linux 3.1r3 on dual-opteron, gcc 3.3.5, all tests
>>>> with
>>>> 4 processors
>>>>
>>>> 1) patched OpenMPI 1.1.1 and 1.1.2rc1 using the two lines from
>>>> Ticket
>>>> 356.
>>>> 2) set TRANSCOMM = -DUseMpi2
>>>>
>>>> Intel ifort 9.0.32 tests (INTFACE=-DAdd):
>>>>
>>>> OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
>>>> In the xCbtest both generated errors until Integer Sum tests
>>>> then
>>>> no more errors)
>>>>
>>>> OpenMPI 1.3a1r11962: no errors until crash:
>>>>
>>>> COMPLEX AMX TESTS: BEGIN.
>>>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>>>> Failing at addr:0xe62000
>>>> [0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
>>>> (opal_backtrace_print+0x1f) [0x2a95aa8c1f]
>>>> *** End of error message ***
>>>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>>>> Failing at addr:0xbc0000
>>>> [0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
>>>> (opal_backtrace_print+0x1f) [0x2a95aa8c1f]
>>>> *** End of error message ***
>>>>
>>>> g95 (Sep 27 2006) tests (INTFACE=-Df77IsF2C):
>>>>
>>>> OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
>>>> In the xCbtest both generated errors until Integer Sum tests
>>>> then
>>>> no more errors)
>>>>
>>>> OpenMPI 1.3a1r11962: no errors until crash:
>>>>
>>>> COMPLEX SUM TESTS: BEGIN.
>>>> COMPLEX SUM TESTS: 1152 TESTS; 864 PASSED, 288 SKIPPED, 0
>>>> FAILED.
>>>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>>>> Failing at addr:0xb6f000
>>>> [0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0(opal_backtrace_print
>>>> +0x1f) [0x2a95aa7c1f]
>>>> *** End of error message ***
>>>>
>>>> COMPLEX AMX TESTS: BEGIN.
>>>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>>>> Failing at addr:0xe27000
>>>> [0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0(opal_backtrace_print
>>>> +0x1f) [0x2a95aa7c1f]
>>>> *** End of error message ***
>>>> 3 additional processes aborted (not shown)
>>>>
>>>>
>>>> Michael
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users


  • application/octet-stream attachment: ddt.patch