Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Graham E Fagg (fagg_at_[hidden])
Date: 2006-07-19 15:03:40


Hi Frank
  I am not sure which alltoall your using in 1.1 so can you please run
the ompi_info utility which is normally built and put into the same
directory as mpirun?

i.e. host% ompi_info

This provides lots of really usefull info on everything before we dig
deeper into your issue

and then more specifically run
host% ompi_info --param coll all

thanks
Graham

On Wed, 19 Jul 2006, Frank Gruellich wrote:

> Hi,
>
> I'm running OFED 1.0 with OpenMPI 1.1b1-1 compiled for Intel Compiler
> 9.1. I get this error message during an MPI_Alltoall call:
>
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:0x1cd04fe0
> [0] func:/usr/ofed/mpi/intel/openmpi-1.1b1-1/lib64/libopal.so.0 [0x2b56964acc75]
> [1] func:/lib64/libpthread.so.0 [0x2b569739b140]
> [2] func:/software/intel/fce/9.1.032/lib/libirc.so(__intel_new_memcpy+0x1540) [0x2b5697278cf0]
> *** End of error message ***
>
> and have no idea about the problem. It arises if I exceed a specific
> number (10) of MPI nodes. The error occures in this code:
>
> do i = 1,npuntos
> print *,'puntos',i
> tam = 2**(i-1)
> tmin = 1e5
> tavg = 0.0d0
> do j = 1,rep
> envio = 8.0d0*j
> call mpi_barrier(mpi_comm_world,ierr)
> time1 = mpi_wtime()
> do k = 1,rep2
> call mpi_alltoall(envio,tam,mpi_byte,recibe,tam,mpi_byte,mpi_comm_world,ierr)
> end do
> call mpi_barrier(mpi_comm_world,ierr)
> time2 = mpi_wtime()
> time = (time2 - time1)/(rep2)
> if (time < tmin) tmin = time
> tavg = tavg + time
> end do
> m_tmin(i) = tmin
> m_tavg(i) = tavg/rep
> end do
>
> this code is said to be running on another system (running IBGD 1.8.x).
> I already tested mpich_mlx_intel-0.9.7_mlx2.1.0-1, but get a similar
> error message when using 13 nodes:
>
> forrtl: severe (174): SIGSEGV, segmentation fault occurred
> Image PC Routine Line Source
> libpthread.so.0 00002B65DA39B140 Unknown Unknown Unknown
> main.out 0000000000448BDB Unknown Unknown Unknown
> [9] Registration failed, file : intra_rdma_alltoall.c, line : 163
> [6] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 9 - MPI_ALLTOALL : Unknown error
> [9] [] Aborting Program!
> 6 - MPI_ALLTOALL : Unknown error
> [6] [] Aborting Program!
> [2] Registration failed, file : intra_rdma_alltoall.c, line : 163
> [11] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 11 - MPI_ALLTOALL : Unknown error
> [11] [] Aborting Program!
> 2 - MPI_ALLTOALL : Unknown error
> [2] [] Aborting Program!
> [10] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 10 - MPI_ALLTOALL : Unknown error
> [10] [] Aborting Program!
> [5] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 5 - MPI_ALLTOALL : Unknown error
> [5] [] Aborting Program!
> [3] Registration failed, file : intra_rdma_alltoall.c, line : 163
> [8] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 3 - MPI_ALLTOALL : Unknown error
> [3] [] Aborting Program!
> 8 - MPI_ALLTOALL : Unknown error
> [8] [] Aborting Program!
> [4] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 4 - MPI_ALLTOALL : Unknown error
> [4] [] Aborting Program!
> [7] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 7 - MPI_ALLTOALL : Unknown error
> [7] [] Aborting Program!
> [0] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 0 - MPI_ALLTOALL : Unknown error
> [0] [] Aborting Program!
> [1] Registration failed, file : intra_rdma_alltoall.c, line : 163
> 1 - MPI_ALLTOALL : Unknown error
> [1] [] Aborting Program!
>
> I don't know wether this is a problem with MPI or Intel Compiler.
> Please, can anybody point me in the right direction what I could have
> done wrong? This is my first post (so be gentle) and at this time I'm
> not very used to the verbosity of this list, so if you need any further
> informations do not hesitate do request them.
>
> Thanks in advance and kind regards,
> --
> Frank Gruellich
> HPC-Techniker
>
> Tel.: +49 3722 528 42
> Fax: +49 3722 528 15
> E-Mail: frank.gruellich_at_[hidden]
>
> MEGWARE Computer GmbH
> Vertrieb und Service
> Nordstrasse 19
> 09247 Chemnitz/Roehrsdorf
> Germany
> http://www.megware.com/
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Thanks,
         Graham.
----------------------------------------------------------------------
Dr Graham E. Fagg | Distributed, Parallel and Meta-Computing
Innovative Computing Lab. PVM3.4, HARNESS, FT-MPI, SNIPE & Open MPI
Computer Science Dept | Suite 203, 1122 Volunteer Blvd,
University of Tennessee | Knoxville, Tennessee, USA. TN 37996-3450
Email: fagg_at_[hidden] | Phone:+1(865)974-5790 | Fax:+1(865)974-8296
Broken complex systems are always derived from working simple systems
----------------------------------------------------------------------