Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Daniel Spångberg (Daniel.Spangberg_at_[hidden])
Date: 2007-06-13 04:49:51


Dear Rainer,

Thanks for your (and George's) very quick answer and the fast solution!
The patch works great.

Best regards
Daniel Spångberg

On Wed, 13 Jun 2007 01:38:26 +0200, Rainer Keller <keller_at_[hidden]> wrote:

> Dear Daniel,
> well, this definitly would be an issue for the users_at_[hidden] list.
> Your report is definitely important and valid, all functions allowing
> MPI_BOTTOM in the fortran interface should be using OMPI_ADDR to convert
> for
> the C-interface....
>
> Aaah, I see George has a similar commit -- possibly, we need to sync.
> Please see the code attached, compared r15030, it adapts the macros to
> use
> OMPI_F2C_...
>
> Thanks,
> Rainer
>
> On Tuesday 12 June 2007 23:15, Daniel Spångberg wrote:
>> One of the users of our computer cluster have reported a problem with
>> his
>> simulation code. We have been able to trace this down to a problem with
>> the MPI struct when absolute addressing is used for the members in the
>> struct, necessating the use of MPI_BOTTOM. The problem occurs with
>> MPI_BCAST, but works fine with MPI_SEND and MPI_RECV. I digged down into
>> the openmpi code and found out that the problem occurs only in fortran,
>> which seems to be because when MPI_BCAST is called, the OMPI_ADDR macro
>> in
>> ompi/mpi/f77/constants.h is never evaluated which thus never turns the
>> fortran MPI_BOTTOM into a C MPI_BOTTOM. When MPI_SEND and MPI_RECV are
>> used the OMPI_ADDR macro is evaluated and no problem occurs.
>>
>> An example of a problematic code (tested on two processes):
>>
>> program testme
>> implicit none
>> include 'mpif.h'
>> integer ierr,rank,size
>> integer btype(1),blen(1),param_batch
>> real param
>> integer(kind=mpi_address_kind) :: disp(1)
>> integer status(MPI_STATUS_SIZE)
>>
>> param=0.
>>
>> call MPI_INIT(ierr)
>> call MPI_GET_ADDRESS(param,disp(1),ierr)
>> if (ierr.ne.MPI_SUCCESS) write(*,*) 'MPI_GET_ADDRESS FAILED'
>> btype(1)=MPI_REAL
>> blen(1)=1
>> call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
>> call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)
>> call MPI_TYPE_CREATE_STRUCT(1,blen,disp,btype,param_batch,ierr)
>> if (ierr.ne.MPI_SUCCESS) then
>> write(*,*) 'MPI_TYPE_CREATE_STRUCT FAILED'
>> endif
>> call MPI_TYPE_COMMIT(param_batch,ierr)
>> if (ierr.ne.MPI_SUCCESS) write(*,*) 'MPI_TYPE_COMMIT FAILED'
>> if (rank.eq.0) param=100.
>> c if (rank.eq.0) then
>> c call MPI_SEND(MPI_BOTTOM,1,
>> c x param_batch,1,0,MPI_COMM_WORLD,ierr)
>> c else
>> c call MPI_RECV(MPI_BOTTOM,1,
>> c x param_batch,0,0,MPI_COMM_WORLD,status,ierr)
>> c endif
>> call MPI_BCAST(MPI_BOTTOM,1,param_batch,0,MPI_COMM_WORLD,ierr)
>> if (ierr.ne.MPI_SUCCESS) write(*,*) 'MPI_BCAST FAILED'
>> write(*,*) 'Rank:',rank,'Size:',size,'Param=',param
>> call MPI_Finalize(ierr)
>>
>> end program testme
>>
>> mpirun -np 2 testme_submitted_to_devel
>> [auchentoshan:21021] *** Process received signal ***
>> [auchentoshan:21021] Signal: Segmentation fault (11)
>> [auchentoshan:21021] Signal code: Address not mapped (1)
>> [auchentoshan:21021] Failing at address: 0x7fc04ffd7c
>> [auchentoshan:21021] [ 0] /lib64/tls/libpthread.so.0 [0x3d9660c430]
>> [auchentoshan:21021] [ 1] /lib64/tls/libc.so.6(memcpy+0x60)
>> [0x3d95571ec0]
>> [auchentoshan:21021] [ 2]
>> /opt/openmpi-1.2.1-gcc4/lib/libmpi.so.0(ompi_convertor_pack+0x164)
>> [0x2a957cf5b4]
>> [auchentoshan:21021] [ 3]
>> /opt/openmpi-1.2.1-gcc4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request
>> _start_copy+0x24c) [0x2a982851dc]
>> [auchentoshan:21021] [ 4]
>> /opt/openmpi-1.2.1-gcc4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x217)
>> [0x2a9827fe77]
>> [auchentoshan:21021] [ 5]
>> /opt/openmpi-1.2.1-gcc4/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast
>> _intra_generic+0x354) [0x2a98ac08b4]
>> [auchentoshan:21021] [ 6]
>> /opt/openmpi-1.2.1-gcc4/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast
>> _intra_binomial+0xc8) [0x2a98ac0bd8]
>> [auchentoshan:21021] [ 7]
>> /opt/openmpi-1.2.1-gcc4/lib/libmpi.so.0(PMPI_Bcast+0x15c) [0x2a957d62ac]
>> [auchentoshan:21021] [ 8]
>> /opt/openmpi-1.2.1-gcc4/lib/libmpi_f77.so.0(pmpi_bcast_+0x5a)
>> [0x2a9567e99a]
>> [auchentoshan:21021] [ 9] testme_submitted_to_devel(MAIN__+0x1f8)
>> [0x401080]
>> [auchentoshan:21021] [10] testme_submitted_to_devel(main+0xe) [0x4011be]
>> [auchentoshan:21021] [11] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
>> [0x3d9551c3fb]
>> [auchentoshan:21021] [12] testme_submitted_to_devel [0x400dfa]
>> [auchentoshan:21021] *** End of error message ***
>> mpirun noticed that job rank 0 with PID 21021 on node auchentoshan
>> exited
>> on signal 11 (Segmentation fault).
>> 1 additional process aborted (not shown)
>>
>> The openmpi version we have tested this on includes the latest, 1.2.2
>> version as well although the log message above is for version 1.2.1.
>> Distribution: Scientific Linux 4.4 (RHEL4 clone). The problem occurs on
>> both AMD64 and i386. The problem occurs both when using gcc/gfortran and
>> the portland group compilers version 7.0-4. Interconnect GBE.
>>
>> daniels_at_auchentoshan:~ > gcc4 --version
>> gcc4 (GCC) 4.1.0 20060515 (Red Hat 4.1.0-18)
>> Copyright (C) 2006 Free Software Foundation, Inc.
>> This is free software; see the source for copying conditions. There is
>> NO
>> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
>> PURPOSE.
>>
>> daniels_at_auchentoshan:~ > gfortran --version
>> GNU Fortran 95 (GCC) 4.1.0 20060515 (Red Hat 4.1.0-18)
>> Copyright (C) 2006 Free Software Foundation, Inc.
>>
>> GNU Fortran comes with NO WARRANTY, to the extent permitted by law.
>> You may redistribute copies of GNU Fortran
>> under the terms of the GNU General Public License.
>> For more information about these matters, see the file named COPYING
>>
>> Working C code which does the same thing:
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <mpi.h>
>>
>> int main(int argc, char **argv)
>> {
>> int rank,size;
>> float param=0.;
>> MPI_Datatype btype[1],param_batch;
>> MPI_Aint disp[1];
>> int blen[1];
>>
>> MPI_Init(&argc,&argv);
>> if (MPI_Get_address(&param,&disp[0])!=MPI_SUCCESS)
>> fprintf(stderr,"MPI_Get_address failed.\n");
>> btype[0]=MPI_FLOAT;
>> blen[0]=1;
>> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>> MPI_Comm_size(MPI_COMM_WORLD,&size);
>> if
>> (MPI_Type_create_struct(1,blen,disp,btype,&param_batch)!=MPI_SUCCESS)
>> fprintf(stderr,"MPI_Type_Create_Struct failed.\n");
>> if (MPI_Type_commit(&param_batch)!=MPI_SUCCESS)
>> fprintf(stderr,"MPI_Type_Commit failed.\n");
>> if (rank==0)
>> param=100.;
>> if
>> (MPI_Bcast(MPI_BOTTOM,1,param_batch,0,MPI_COMM_WORLD)!=MPI_SUCCESS)
>> fprintf(stderr,"MPI_Bcast failed.\n");
>> printf("Rank:%d, Size:%d, Param=%f\n",rank,size,param);
>> MPI_Finalize();
>> return 0;
>> }
>>
>>
>> Best regards
>> Daniel Spångberg
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>