Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2006-07-19 20:58:43


Frank,

On the all-to-all collective the send and receive buffers has to be able
to contain all the information you try to send. On this particular case,
as you initialize the envio variable to a double I suppose it is defined
as a double. If it's the case then the error is that the send operation
will send more data than the amount available on the envio variable.

If you want to be able to do correctly the all-to-all in your example,
make sure the envio variable has a size at least equal to:
tam * sizeof(byte) * NPROCS, where NPROCS is the number of procs available
on the mpi_comm_world communicator.

Moreover, the error messages seems to indicate that some memory
registration failed. This can effectively be the send buffer.

   Thanks,
     George.

On Wed, 19 Jul 2006, Frank Gruellich wrote:

> Hi,
>
> I'm running OFED 1.0 with OpenMPI 1.1b1-1 compiled for Intel Compiler
> 9.1. I get this error message during an MPI_Alltoall call:
>
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0x1cd04fe0
> [0] func:/usr/ofed/mpi/intel/openmpi-1.1b1-1/lib64/libopal.so.0 [0x2b56964acc75]
> [1] func:/lib64/libpthread.so.0 [0x2b569739b140]
> [2] func:/software/intel/fce/9.1.032/lib/libirc.so(__intel_new_memcpy+0x1540) [0x2b5697278cf0]
> *** End of error message ***
>
> and have no idea about the problem. It arises if I exceed a specific
> number (10) of MPI nodes. The error occures in this code:
>
> do i = 1,npuntos
> print *,'puntos',i
> tam = 2**(i-1)
> tmin = 1e5
> tavg = 0.0d0
> do j = 1,rep
> envio = 8.0d0*j
> call mpi_barrier(mpi_comm_world,ierr)
> time1 = mpi_wtime()
> do k = 1,rep2
> call mpi_alltoall(envio,tam,mpi_byte,recibe,tam,mpi_byte,mpi_comm_world,ierr)
> end do
> call mpi_barrier(mpi_comm_world,ierr)
> time2 = mpi_wtime()
> time = (time2 - time1)/(rep2)
> if (time < tmin) tmin = time
> tavg = tavg + time
> end do
> m_tmin(i) = tmin
> m_tavg(i) = tavg/rep
> end do
>
> this code is said to be running on another system (running IBGD 1.8.x).
> I already tested mpich_mlx_intel-0.9.7_mlx2.1.0-1, but get a similar
> error message when using 13 nodes:
>
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image PC Routine Line Source
> libpthread.so.0 00002B65DA39B140 Unknown Unknown Unknown
> main.out 0000000000448BDB Unknown Unknown Unknown
> [9] Registration failed, file : intra_rdma_alltoall.c, line : 163
> [6] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 9 - MPI_ALLTOALL : Unknown error
> [9] [] Aborting Program!
> 6 - MPI_ALLTOALL : Unknown error
> [6] [] Aborting Program!
> [2] Registration failed, file : intra_rdma_alltoall.c, line : 163
> [11] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 11 - MPI_ALLTOALL : Unknown error
> [11] [] Aborting Program!
> 2 - MPI_ALLTOALL : Unknown error
> [2] [] Aborting Program!
> [10] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 10 - MPI_ALLTOALL : Unknown error
> [10] [] Aborting Program!
> [5] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 5 - MPI_ALLTOALL : Unknown error
> [5] [] Aborting Program!
> [3] Registration failed, file : intra_rdma_alltoall.c, line : 163
> [8] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 3 - MPI_ALLTOALL : Unknown error
> [3] [] Aborting Program!
> 8 - MPI_ALLTOALL : Unknown error
> [8] [] Aborting Program!
> [4] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 4 - MPI_ALLTOALL : Unknown error
> [4] [] Aborting Program!
> [7] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 7 - MPI_ALLTOALL : Unknown error
> [7] [] Aborting Program!
> [0] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 0 - MPI_ALLTOALL : Unknown error
> [0] [] Aborting Program!
> [1] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 1 - MPI_ALLTOALL : Unknown error
> [1] [] Aborting Program!
>
> I don't know wether this is a problem with MPI or Intel Compiler.
> Please, can anybody point me in the right direction what I could have
> done wrong? This is my first post (so be gentle) and at this time I'm
> not very used to the verbosity of this list, so if you need any further
> informations do not hesitate do request them.
>
> Thanks in advance and kind regards,
>

"We must accept finite disappointment, but we must never lose infinite
hope."
                                   Martin Luther King