Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [BUG?] OpenMPI with openib on SPARC64: Signal: Bus error (10)
From: TERRY DONTJE (terry.dontje_at_[hidden])
Date: 2011-11-22 05:49:04


The error you are seeing is usually indicative of some code operating on
memory that isn't aligned properly for a SPARC instruction being used.
The address that is causing the failure is odd aligned which is more
than likely the culprit. If you have a core dump and can disassemble
the code that is being ran at the time it probably will be some sort of
instruction requiring an alignment. If the MPI you are using is
something you built can you try and build OMPI with -g and get the line
number in the PML that is failing?

I haven't seen this type of error for some time but I do all of my SPARC
testing on Solaris with Solaris Studio Compilers. You may want to try
to compile the benchmark with "-m32" to see if that helps. Though being
an odd address I suspect it might not. If you can use the Studio
Compilers you could try giving the compilers the -xmemalign=8i option
when building the benchmark and see if that resolves the issue. This
would help to assure the issue is just an alignment of data we are
slicing and dicing as opposed to wrongly addressing memory.

--td

On 11/21/2011 8:51 PM, Lukas Razik wrote:
> Hello everybody!
>
> I've Sun T5120 (SPARC64) Servers with
> - Debian: 6.0.3
> - linux-2.6.39.4 (from kernel.org)
> - OFED-1.5.3.2
> - InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)
> with newest FW (2.9.1)
> and the following issue:
>
> If I try to mpirun a program like the osu_latency benchmark:
> $ /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun -np 2 --mca btl_base_verbose 50 --mca btl_openib_verbose 1 -host cluster1,cluster2 /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency
>
> then I get these errors:
> <snip>
> # OSU MPI Latency Test v3.1.1
> # Size Latency (us)
> [cluster1:64027] *** Process received signal ***
> [cluster1:64027] Signal: Bus error (10)
> [cluster1:64027] Signal code: Invalid address alignment (1)
> [cluster1:64027] Failing at address: 0xaa9053
> [cluster1:64027] [ 0] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_pml_ob1.so(+0x62f0) [0xfffff8010209e2f0]
> [cluster1:64027] [ 1] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0x2904) [0xfffff801031ce904]
> [cluster1:64027] [ 2] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0xb498) [0xfffff801031d7498]
> [cluster1:64027] [ 3] /usr/mpi/gcc/openmpi-1.4.3/lib64/libmpi.so.0(MPI_Barrier+0xbc) [0xfffff8010005a97c]
> [cluster1:64027] [ 4] /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(main+0x2b0) [0x100f34]
> [cluster1:64027] [ 5] /lib64/libc.so.6(__libc_start_main+0x100) [0xfffff80100ac1240]
> [cluster1:64027] [ 6] /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(_start+0x2c) [0x100bac]
> [cluster1:64027] *** End of error message ***
> [cluster2:02759] *** Process received signal ***
> [cluster2:02759] Signal: Bus error (10)
> [cluster2:02759] Signal code: Invalid address alignment (1)
> [cluster2:02759] Failing at address: 0xaa9053
> [cluster2:02759] [ 0] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_pml_ob1.so(+0x62f0) [0xfffff8010209e2f0]
> [cluster2:02759] [ 1] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0x2904) [0xfffff801031ce904]
> [cluster2:02759] [ 2] /usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0xb498) [0xfffff801031d7498]
> [cluster2:02759] [ 3] /usr/mpi/gcc/openmpi-1.4.3/lib64/libmpi.so.0(MPI_Barrier+0xbc) [0xfffff8010005a97c]
> [cluster2:02759] [ 4] /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(main+0x2b0) [0x100f34]
> [cluster2:02759] [ 5] /lib64/libc.so.6(__libc_start_main+0x100) [0xfffff80100ac1240]
> [cluster2:02759] [ 6] /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(_start+0x2c) [0x100bac]
> [cluster2:02759] *** End of error message ***
> ---
>
> The whole output can be found here:
> http://net.razik.de/linux/T5120/openmpi-1.4.3-verbose.txt
>
> That's my 'ompi_info --param all all' output:
> http://net.razik.de/linux/T5120/openmpi-1.4.3_param_all_all.txt
>
> Same error with OFED-1.5.4-rc4 and also the same with openmpi-1.4.4.
>
> If I disable openib the I get the right results:
> $ /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun --mca btl ^openib -np 2 -host cluster1,cluster2 /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency
> # OSU MPI Latency Test v3.1.1
> # Size Latency (us)
> 0 143.53
> 1 140.50
> <snip>
> ---
>
> ibverbs seems to work:
> $ ibv_srq_pingpong -n 1000000 cluster2
> <snip>
> 8192000000 bytes in 4.15 seconds = 15806.63 Mbit/sec
> 1000000 iters in 4.15 seconds = 4.15 usec/iter
> ---
>
> These are the installed OFED packets:
> kernel-ib
> ofed-scripts
> libibverbs
> libibverbs-devel
> libibverbs-utils
> libmlx4
> libmlx4-devel
> libibumad
> libibumad-devel
> libibmad
> libibmad-devel
> librdmacm
> librdmacm-utils
> librdmacm-devel
> opensm-libs
> ibutils
> infiniband-diags
> qperf
> ofed-docs
> mpi-selector
> openmpi_gcc
> mpitests_openmpi_gcc
> ---
>
> I don't know which mailing list is the right one and I'm very thankful for any help!
> If you have questions, please ask!
>
> Best regards,
> Lukas
>
>
> The archives of the lists I've sent this email to:
> http://lists.openfabrics.org/pipermail/ewg/2011-November/thread.html
> http://www.open-mpi.org/community/lists/devel/2011/11/date.php
> http://thread.gmane.org/gmane.linux.drivers.rdma/
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden] <mailto:terry.dontje_at_[hidden]>



picture