Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Graham E Fagg (fagg_at_[hidden])
Date: 2006-07-19 15:03:40


Hi Frank
  I am not sure which alltoall your using in 1.1 so can you please run
the ompi_info utility which is normally built and put into the same
directory as mpirun?

i.e. host% ompi_info

This provides lots of really usefull info on everything before we dig
deeper into your issue

and then more specifically run
host% ompi_info --param coll all

thanks
Graham

On Wed, 19 Jul 2006, Frank Gruellich wrote:

> Hi,
>
> I'm running OFED 1.0 with OpenMPI 1.1b1-1 compiled for Intel Compiler
> 9.1. I get this error message during an MPI_Alltoall call:
>
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0x1cd04fe0
> [0] func:/usr/ofed/mpi/intel/openmpi-1.1b1-1/lib64/libopal.so.0 [0x2b56964acc75]
> [1] func:/lib64/libpthread.so.0 [0x2b569739b140]
> [2] func:/software/intel/fce/9.1.032/lib/libirc.so(__intel_new_memcpy+0x1540) [0x2b5697278cf0]
> *** End of error message ***
>
> and have no idea about the problem. It arises if I exceed a specific
> number (10) of MPI nodes. The error occures in this code:
>
> do i = 1,npuntos
> print *,'puntos',i
> tam = 2**(i-1)
> tmin = 1e5
> tavg = 0.0d0
> do j = 1,rep
> envio = 8.0d0*j
> call mpi_barrier(mpi_comm_world,ierr)
> time1 = mpi_wtime()
> do k = 1,rep2
> call mpi_alltoall(envio,tam,mpi_byte,recibe,tam,mpi_byte,mpi_comm_world,ierr)
> end do
> call mpi_barrier(mpi_comm_world,ierr)
> time2 = mpi_wtime()
> time = (time2 - time1)/(rep2)
> if (time < tmin) tmin = time
> tavg = tavg + time
> end do
> m_tmin(i) = tmin
> m_tavg(i) = tavg/rep
> end do
>
> this code is said to be running on another system (running IBGD 1.8.x).
> I already tested mpich_mlx_intel-0.9.7_mlx2.1.0-1, but get a similar
> error message when using 13 nodes:
>
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image PC Routine Line Source
> libpthread.so.0 00002B65DA39B140 Unknown Unknown Unknown
> main.out 0000000000448BDB Unknown Unknown Unknown
> [9] Registration failed, file : intra_rdma_alltoall.c, line : 163
> [6] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 9 - MPI_ALLTOALL : Unknown error
> [9] [] Aborting Program!
> 6 - MPI_ALLTOALL : Unknown error
> [6] [] Aborting Program!
> [2] Registration failed, file : intra_rdma_alltoall.c, line : 163
> [11] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 11 - MPI_ALLTOALL : Unknown error
> [11] [] Aborting Program!
> 2 - MPI_ALLTOALL : Unknown error
> [2] [] Aborting Program!
> [10] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 10 - MPI_ALLTOALL : Unknown error
> [10] [] Aborting Program!
> [5] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 5 - MPI_ALLTOALL : Unknown error
> [5] [] Aborting Program!
> [3] Registration failed, file : intra_rdma_alltoall.c, line : 163
> [8] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 3 - MPI_ALLTOALL : Unknown error
> [3] [] Aborting Program!
> 8 - MPI_ALLTOALL : Unknown error
> [8] [] Aborting Program!
> [4] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 4 - MPI_ALLTOALL : Unknown error
> [4] [] Aborting Program!
> [7] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 7 - MPI_ALLTOALL : Unknown error
> [7] [] Aborting Program!
> [0] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 0 - MPI_ALLTOALL : Unknown error
> [0] [] Aborting Program!
> [1] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 1 - MPI_ALLTOALL : Unknown error
> [1] [] Aborting Program!
>
> I don't know wether this is a problem with MPI or Intel Compiler.
> Please, can anybody point me in the right direction what I could have
> done wrong? This is my first post (so be gentle) and at this time I'm
> not very used to the verbosity of this list, so if you need any further
> informations do not hesitate do request them.
>
> Thanks in advance and kind regards,
> --
> Frank Gruellich
> HPC-Techniker
>
> Tel.: +49 3722 528 42
> Fax: +49 3722 528 15
> E-Mail: frank.gruellich_at_[hidden]
>
> MEGWARE Computer GmbH
> Vertrieb und Service
> Nordstrasse 19
> 09247 Chemnitz/Roehrsdorf
> Germany
> http://www.megware.com/
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Thanks,
         Graham.
----------------------------------------------------------------------
Dr Graham E. Fagg | Distributed, Parallel and Meta-Computing
Innovative Computing Lab. PVM3.4, HARNESS, FT-MPI, SNIPE & Open MPI
Computer Science Dept | Suite 203, 1122 Volunteer Blvd,
University of Tennessee | Knoxville, Tennessee, USA. TN 37996-3450
Email: fagg_at_[hidden] | Phone:+1(865)974-5790 | Fax:+1(865)974-8296
Broken complex systems are always derived from working simple systems
----------------------------------------------------------------------