Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2007-06-12 19:00:31


Daniel,

Thanks for the very precise and clear report. I checked the code and
indeed we miss the OMPI_ADDR macro call in a lot of places in the
Fortran interface (not for point-to-point communications) . I will
correct the issue shortly and hopefully the the fix will make its way
into the next release (1.2.3).

  Thanks,
    george.

Daniel Spångberg wrote:
> Dear openmpi developers,
>
> I have very recently joined this list to be able to submit this report, as
> recommended in the FAQ, why I have not had time to familiarize myself with
> the etiquette of this list, for which I apologize in advance. If you
> require any additional information I am more than happy to provide you
> with such information.
>
> One of the users of our computer cluster have reported a problem with his
> simulation code. We have been able to trace this down to a problem with
> the MPI struct when absolute addressing is used for the members in the
> struct, necessating the use of MPI_BOTTOM. The problem occurs with
> MPI_BCAST, but works fine with MPI_SEND and MPI_RECV. I digged down into
> the openmpi code and found out that the problem occurs only in fortran,
> which seems to be because when MPI_BCAST is called, the OMPI_ADDR macro in
> ompi/mpi/f77/constants.h is never evaluated which thus never turns the
> fortran MPI_BOTTOM into a C MPI_BOTTOM. When MPI_SEND and MPI_RECV are
> used the OMPI_ADDR macro is evaluated and no problem occurs.
>
> An example of a problematic code (tested on two processes):
>
> program testme
> implicit none
> include 'mpif.h'
> integer ierr,rank,size
> integer btype(1),blen(1),param_batch
> real param
> integer(kind=mpi_address_kind) :: disp(1)
> integer status(MPI_STATUS_SIZE)
>
> param=0.
>
> call MPI_INIT(ierr)
> call MPI_GET_ADDRESS(param,disp(1),ierr)
> if (ierr.ne.MPI_SUCCESS) write(*,*) 'MPI_GET_ADDRESS FAILED'
> btype(1)=MPI_REAL
> blen(1)=1
> call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
> call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)
> call MPI_TYPE_CREATE_STRUCT(1,blen,disp,btype,param_batch,ierr)
> if (ierr.ne.MPI_SUCCESS) then
> write(*,*) 'MPI_TYPE_CREATE_STRUCT FAILED'
> endif
> call MPI_TYPE_COMMIT(param_batch,ierr)
> if (ierr.ne.MPI_SUCCESS) write(*,*) 'MPI_TYPE_COMMIT FAILED'
> if (rank.eq.0) param=100.
> c if (rank.eq.0) then
> c call MPI_SEND(MPI_BOTTOM,1,
> c x param_batch,1,0,MPI_COMM_WORLD,ierr)
> c else
> c call MPI_RECV(MPI_BOTTOM,1,
> c x param_batch,0,0,MPI_COMM_WORLD,status,ierr)
> c endif
> call MPI_BCAST(MPI_BOTTOM,1,param_batch,0,MPI_COMM_WORLD,ierr)
> if (ierr.ne.MPI_SUCCESS) write(*,*) 'MPI_BCAST FAILED'
> write(*,*) 'Rank:',rank,'Size:',size,'Param=',param
> call MPI_Finalize(ierr)
>
> end program testme
>
> mpirun -np 2 testme_submitted_to_devel
> [auchentoshan:21021] *** Process received signal ***
> [auchentoshan:21021] Signal: Segmentation fault (11)
> [auchentoshan:21021] Signal code: Address not mapped (1)
> [auchentoshan:21021] Failing at address: 0x7fc04ffd7c
> [auchentoshan:21021] [ 0] /lib64/tls/libpthread.so.0 [0x3d9660c430]
> [auchentoshan:21021] [ 1] /lib64/tls/libc.so.6(memcpy+0x60) [0x3d95571ec0]
> [auchentoshan:21021] [ 2]
> /opt/openmpi-1.2.1-gcc4/lib/libmpi.so.0(ompi_convertor_pack+0x164)
> [0x2a957cf5b4]
> [auchentoshan:21021] [ 3]
> /opt/openmpi-1.2.1-gcc4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x24c)
> [0x2a982851dc]
> [auchentoshan:21021] [ 4]
> /opt/openmpi-1.2.1-gcc4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x217)
> [0x2a9827fe77]
> [auchentoshan:21021] [ 5]
> /opt/openmpi-1.2.1-gcc4/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_generic+0x354)
> [0x2a98ac08b4]
> [auchentoshan:21021] [ 6]
> /opt/openmpi-1.2.1-gcc4/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_binomial+0xc8)
> [0x2a98ac0bd8]
> [auchentoshan:21021] [ 7]
> /opt/openmpi-1.2.1-gcc4/lib/libmpi.so.0(PMPI_Bcast+0x15c) [0x2a957d62ac]
> [auchentoshan:21021] [ 8]
> /opt/openmpi-1.2.1-gcc4/lib/libmpi_f77.so.0(pmpi_bcast_+0x5a)
> [0x2a9567e99a]
> [auchentoshan:21021] [ 9] testme_submitted_to_devel(MAIN__+0x1f8)
> [0x401080]
> [auchentoshan:21021] [10] testme_submitted_to_devel(main+0xe) [0x4011be]
> [auchentoshan:21021] [11] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
> [0x3d9551c3fb]
> [auchentoshan:21021] [12] testme_submitted_to_devel [0x400dfa]
> [auchentoshan:21021] *** End of error message ***
> mpirun noticed that job rank 0 with PID 21021 on node auchentoshan exited
> on signal 11 (Segmentation fault).
> 1 additional process aborted (not shown)
>
> The openmpi version we have tested this on includes the latest, 1.2.2
> version as well although the log message above is for version 1.2.1.
> Distribution: Scientific Linux 4.4 (RHEL4 clone). The problem occurs on
> both AMD64 and i386. The problem occurs both when using gcc/gfortran and
> the portland group compilers version 7.0-4. Interconnect GBE.
>
> daniels_at_auchentoshan:~ > gcc4 --version
> gcc4 (GCC) 4.1.0 20060515 (Red Hat 4.1.0-18)
> Copyright (C) 2006 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions. There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
>
> daniels_at_auchentoshan:~ > gfortran --version
> GNU Fortran 95 (GCC) 4.1.0 20060515 (Red Hat 4.1.0-18)
> Copyright (C) 2006 Free Software Foundation, Inc.
>
> GNU Fortran comes with NO WARRANTY, to the extent permitted by law.
> You may redistribute copies of GNU Fortran
> under the terms of the GNU General Public License.
> For more information about these matters, see the file named COPYING
>
> Working C code which does the same thing:
> #include <stdio.h>
> #include <stdlib.h>
> #include <mpi.h>
>
> int main(int argc, char **argv)
> {
> int rank,size;
> float param=0.;
> MPI_Datatype btype[1],param_batch;
> MPI_Aint disp[1];
> int blen[1];
>
> MPI_Init(&argc,&argv);
> if (MPI_Get_address(&param,&disp[0])!=MPI_SUCCESS)
> fprintf(stderr,"MPI_Get_address failed.\n");
> btype[0]=MPI_FLOAT;
> blen[0]=1;
> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
> MPI_Comm_size(MPI_COMM_WORLD,&size);
> if (MPI_Type_create_struct(1,blen,disp,btype,&param_batch)!=MPI_SUCCESS)
> fprintf(stderr,"MPI_Type_Create_Struct failed.\n");
> if (MPI_Type_commit(&param_batch)!=MPI_SUCCESS)
> fprintf(stderr,"MPI_Type_Commit failed.\n");
> if (rank==0)
> param=100.;
> if (MPI_Bcast(MPI_BOTTOM,1,param_batch,0,MPI_COMM_WORLD)!=MPI_SUCCESS)
> fprintf(stderr,"MPI_Bcast failed.\n");
> printf("Rank:%d, Size:%d, Param=%f\n",rank,size,param);
> MPI_Finalize();
> return 0;
> }
>
>
> Best regards
> Daniel Spångberg
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>