Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2006-10-04 16:41:47


OK, that was my 5 minutes hall of shame. Setting the verbosity level
in bt.dat to 6 give me enough information to know exactly the data-
type share. Now, I know how to fix things ...

   george.

On Oct 4, 2006, at 4:35 PM, George Bosilca wrote:

> I'm working on this bug. As far as I see the patch from the bug 365
> do not help us here. However, on my 64 bits machines (not Opteron but
> G5) I don't get the segfault. Anyway, I get the bad data transmission
> for test #1 and #51. So far my main problem is that I cannot
> reproduce these errors with any other data-type tests [and believe me
> we have a bunch of them]. The only one who fails is the BLACS. I
> wonder what the data-type looks like for the failing tests. Someone
> here knows how to extract the BLACS data-type (for test #1 and #51) ?
> Or how to force BLACS to print the data-type information for each
> test (M, N and so on) ?
>
> Thanks,
> george.
>
> On Oct 4, 2006, at 4:13 PM, Michael Kluskens wrote:
>
>> On Oct 4, 2006, at 8:22 AM, Harald Forbert wrote:
>>
>>> The TRANSCOMM setting that we are using here and that I think is the
>>> correct one is "-DUseMpi2" since OpenMPI implements the
>>> corresponding
>>> mpi2 calls. You need a recent version of BLACS for this setting
>>> to be available (1.1 with patch 3 should be fine). Together with the
>>> patch to openmpi1.1.1 from ticket 356 we are passing the blacs
>>> tester
>>> for 4 processors. I didn't have to time to test with other numbers
>>> though.
>>
>> Unfortunately this did not solve the problems I'm seeing, could be
>> that my system is 64 bits (another person seeing problems on an
>> Opteron system).
>>
>> New tests of BLACS 1.1p3 vs. OpenMPI (1.1.1, 1.1.2rc1, 1.3a1r11962)
>> with Intel ifort 9.0.32 and g95 (Sep 27 2006).
>>
>> System: Debian Linux 3.1r3 on dual-opteron, gcc 3.3.5, all tests with
>> 4 processors
>>
>> 1) patched OpenMPI 1.1.1 and 1.1.2rc1 using the two lines from Ticket
>> 356.
>> 2) set TRANSCOMM = -DUseMpi2
>>
>> Intel ifort 9.0.32 tests (INTFACE=-DAdd):
>>
>> OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
>> In the xCbtest both generated errors until Integer Sum tests then
>> no more errors)
>>
>> OpenMPI 1.3a1r11962: no errors until crash:
>>
>> COMPLEX AMX TESTS: BEGIN.
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>> Failing at addr:0xe62000
>> [0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
>> (opal_backtrace_print+0x1f) [0x2a95aa8c1f]
>> *** End of error message ***
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>> Failing at addr:0xbc0000
>> [0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
>> (opal_backtrace_print+0x1f) [0x2a95aa8c1f]
>> *** End of error message ***
>>
>> g95 (Sep 27 2006) tests (INTFACE=-Df77IsF2C):
>>
>> OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
>> In the xCbtest both generated errors until Integer Sum tests then
>> no more errors)
>>
>> OpenMPI 1.3a1r11962: no errors until crash:
>>
>> COMPLEX SUM TESTS: BEGIN.
>> COMPLEX SUM TESTS: 1152 TESTS; 864 PASSED, 288 SKIPPED, 0
>> FAILED.
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>> Failing at addr:0xb6f000
>> [0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0(opal_backtrace_print
>> +0x1f) [0x2a95aa7c1f]
>> *** End of error message ***
>>
>> COMPLEX AMX TESTS: BEGIN.
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>> Failing at addr:0xe27000
>> [0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0(opal_backtrace_print
>> +0x1f) [0x2a95aa7c1f]
>> *** End of error message ***
>> 3 additional processes aborted (not shown)
>>
>>
>> Michael
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users