Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Michael Kluskens (mklus_at_[hidden])
Date: 2006-10-03 17:43:56


Additional note on the the BLACS vs. OpenMPI 1.1.1 & 1.3 problems:

The BLACS install program xtc_CsameF77 says to not use -DCsameF77
with OpenMPI; however, because of an oversight I used it in my first
tests -- for OpenMPI 1.1.1 the errors are the same with and without
this setting; however, without it the tester program is very slow
with OpenMPI 1.1.1 or hangs at "RUNNING REPEATABLE SUM TEST" near the
end. OpenMPI 1.1.2rc1 behaved nearly identically.

With regards to OpenMPI 1.3, not using -DCsameF77 (that is setting
TRANSCOMM blank), prevents the crash I observed earlier; however,
massive errors begin at the "DOUBLE COMPLEX AMX" tests and then the
auxiliary tests at the end are very slow or hangs at "RUNNING
REPEATABLE SUM TEST".

I don't know enough about the internals of OpenMPI to understand the
following discussion or to understand if the install program
xtc_CsameF77 works correctly with OpenMPI:

# If you know that your MPI uses the same handles for fortran and C
# communicators, you can replace the empty macro definition below with
# the macro definition on the following line.
   TRANSCOMM = -DCSameF77

The complete details are below:

# If you know something about your system, you may make it easier for
the
# BLACS to translate between C and fortran communicators. If the empty
# macro defininition is left alone, this translation will cause the C
# BLACS to globally block for MPI_COMM_WORLD on calls to BLACS_GRIDINIT
# and BLACS_GRIDMAP. If you choose one of the options for translating
# the context, neither the C or fortran calls will globally block.
# If you are using MPICH, or a derivitive system, you can replace the
# empty macro definition below with the following (note that if you let
# MPICH do the translation between C and fortran, you must also
indicate
# here if your system has pointers that are longer than integers.
If so,
# define -DPOINTER_64_BITS=1.) For help on setting TRANSCOMM, you can
# run BLACS/INSTALL/xtc_CsameF77 and BLACS/INSTALL/xtc_UseMpich as
# explained in BLACS/INSTALL/README.
# TRANSCOMM = -DUseMpich
#
# If you know that your MPI uses the same handles for fortran and C
# communicators, you can replace the empty macro definition below with
# the macro definition on the following line.
   TRANSCOMM = -DCSameF77
#
-----------------------------------------------------------------------
# TRANSCOMM =

Michael

ps. I have successfully tested MPICH2 1.0.4p1 with BLACS 1.1p3 on the
same machine with same compilers.

On Oct 3, 2006, at 12:14 PM, Jeff Squyres wrote:

> Thanks Michael -- I've updated ticket 356 with this info for v1.1, and
> created ticket 464 for the trunk (v1.3) issue.
>
> https://svn.open-mpi.org/trac/ompi/ticket/356
> https://svn.open-mpi.org/trac/ompi/ticket/464
>
> On 10/3/06 10:53 AM, "Michael Kluskens" <mklus_at_[hidden]> wrote:
>
>> Summary:
>>
>> OpenMPI 1.1.1 and 1.3a1r11943 have different bugs with regards to
>> BLACS 1.1p3.
>>
>> 1.3 fails where 1.1.1 passes and vice-versus.
>>
>> (1.1.1): Integer, real, double precision SDRV tests fail cases 1 &
>> 51, then lots of errors until Integer SUM test then all tests pass.
>>
>> (1.3): No errors until it crashes on the Complex AMX test (which is
>> after the Integer Sum test).
>>
>> System configuration: Debian 3.1r3 on dual opteron, gcc 3.3.5, Intel
>> ifort 9.1.032.
>>
>> On Oct 3, 2006, at 2:44 AM, Åke Sandgren wrote:
>>
>>> On Mon, 2006-10-02 at 18:39 -0400, Michael Kluskens wrote:
>>>> OpenMPI, BLACS, and blacstester built just fine. Tester reports
>>>> errors for integer and real cases #1 and #51 and more for the other
>>>> types..
>>>>
>>>> <http://svn.open-mpi.org/trac/ompi/ticket/356> is an open ticket
>>>> related to this.
>>>
>>> Finally someone else with the same problem!!!
>>>
>>> I tried the suggested fix from ticket 356 but it didn't help.
>>> I still get lots of errors in the blacstest.
>>>
>>> I'm running on a dual-cpu opteron with Ubuntu dapper and gcc-4.0.
>>> The tests also failed on our i386 Ubuntu breezy system with gcc-3.4
>>
>> More details of my two tests:
>> --------------------------------
>> OpenMPI 1.1.1
>> ./configure --prefix=/opt/intel9.1/openmpi/1.1.1 F77=ifort
>> FC=ifort --
>> with-mpi-f90-size=medium
>>
>> BLACS 1.1 patch 3, Bmake.inc based on Bmake.MPI-LINUX with following
>> changes:
>>
>> BTOPdir = /opt/intel9.1/openmpi/1.1.1/BLACS
>> BLACSDBGLVL = 1
>> MPIdir = /opt/intel9.1/openmpi/1.1.1
>> MPILIB =
>> INTFACE = -DAdd_
>> F77 = $(MPIdir)/bin/mpif77
>> CC = $(MPIdir)/bin/mpicc
>> CCFLAGS = -O3
>>
>> --------------------------------
>> OpenMPI 1.3a1r11943
>> ./configure --prefix=/opt/intel9.1/openmpi/1.3 F77=ifort FC=ifort --
>> with-mpi-f90-size=medium
>>
>> similar changes for Bmake.inc in BLACS.
>>
>> test launched in BLACS/TESTING/EXE using:
>>
>> mpirun --prefix /opt/intel9.1/openmpi/1.3 -np 4 xCbtest_MPI-LINUX-1
>>
>> No errors works much better but eventually failures with:
>>
>> COMPLEX AMX TESTS: BEGIN.
>> Signal:11 info.si_errno:0(Success) si_code:128()
>> Failing at addr:(nil)
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>> Failing at addr:0xb80000
>> [0] func:/opt/intel9.1/openmpi/1.3/lib/libopal.so.0
>> (opal_backtrace_print+0x1f) [0x2a95aa5c1f]
>> *** End of error message ***
>>
>> Michael
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems
>