Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2006-10-04 16:35:42


I'm working on this bug. As far as I see the patch from the bug 365
do not help us here. However, on my 64 bits machines (not Opteron but
G5) I don't get the segfault. Anyway, I get the bad data transmission
for test #1 and #51. So far my main problem is that I cannot
reproduce these errors with any other data-type tests [and believe me
we have a bunch of them]. The only one who fails is the BLACS. I
wonder what the data-type looks like for the failing tests. Someone
here knows how to extract the BLACS data-type (for test #1 and #51) ?
Or how to force BLACS to print the data-type information for each
test (M, N and so on) ?

   Thanks,
     george.

On Oct 4, 2006, at 4:13 PM, Michael Kluskens wrote:

> On Oct 4, 2006, at 8:22 AM, Harald Forbert wrote:
>
>> The TRANSCOMM setting that we are using here and that I think is the
>> correct one is "-DUseMpi2" since OpenMPI implements the corresponding
>> mpi2 calls. You need a recent version of BLACS for this setting
>> to be available (1.1 with patch 3 should be fine). Together with the
>> patch to openmpi1.1.1 from ticket 356 we are passing the blacs tester
>> for 4 processors. I didn't have to time to test with other numbers
>> though.
>
> Unfortunately this did not solve the problems I'm seeing, could be
> that my system is 64 bits (another person seeing problems on an
> Opteron system).
>
> New tests of BLACS 1.1p3 vs. OpenMPI (1.1.1, 1.1.2rc1, 1.3a1r11962)
> with Intel ifort 9.0.32 and g95 (Sep 27 2006).
>
> System: Debian Linux 3.1r3 on dual-opteron, gcc 3.3.5, all tests with
> 4 processors
>
> 1) patched OpenMPI 1.1.1 and 1.1.2rc1 using the two lines from Ticket
> 356.
> 2) set TRANSCOMM = -DUseMpi2
>
> Intel ifort 9.0.32 tests (INTFACE=-DAdd):
>
> OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
> In the xCbtest both generated errors until Integer Sum tests then
> no more errors)
>
> OpenMPI 1.3a1r11962: no errors until crash:
>
> COMPLEX AMX TESTS: BEGIN.
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0xe62000
> [0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
> (opal_backtrace_print+0x1f) [0x2a95aa8c1f]
> *** End of error message ***
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0xbc0000
> [0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
> (opal_backtrace_print+0x1f) [0x2a95aa8c1f]
> *** End of error message ***
>
> g95 (Sep 27 2006) tests (INTFACE=-Df77IsF2C):
>
> OpenMPI 1.1.1 (patched) & OpenMPI 1.1.2rc1 (patched):
> In the xCbtest both generated errors until Integer Sum tests then
> no more errors)
>
> OpenMPI 1.3a1r11962: no errors until crash:
>
> COMPLEX SUM TESTS: BEGIN.
> COMPLEX SUM TESTS: 1152 TESTS; 864 PASSED, 288 SKIPPED, 0 FAILED.
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0xb6f000
> [0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0(opal_backtrace_print
> +0x1f) [0x2a95aa7c1f]
> *** End of error message ***
>
> COMPLEX AMX TESTS: BEGIN.
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0xe27000
> [0] func:/opt/g95/openmpi/1.3/lib/libopal.so.0(opal_backtrace_print
> +0x1f) [0x2a95aa7c1f]
> *** End of error message ***
> 3 additional processes aborted (not shown)
>
>
> Michael
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users