Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [BUG?] OpenMPI with openib on SPARC64: Signal: Bus error (10)
From: Lukas Razik (linux_at_[hidden])
Date: 2011-11-22 18:05:26


TERRY DONTJE <terry.dontje_at_[hidden]> wrote:
>On 11/22/2011 5:49 AM, TERRY DONTJE wrote:
>The error you are seeing is usually indicative of some code operating on memory that isn't aligned properly for a SPARC instruction being used.  The address that is causing the failure is odd aligned which is more than likely the culprit.  If you have a core dump and can disassemble the code that is being ran at the time it probably will be some sort of instruction requiring an alignment.  If the MPI you are using is something you built can you try and build OMPI with -g and get the line number in the PML that is failing?
>>
>>I haven't seen this type of error for some time but I do all of my
      SPARC testing on Solaris with Solaris Studio Compilers.  You may
      want to try to compile the benchmark with "-m32" to see if that
      helps.  Though being an odd address I suspect it might not.  If
      you can use the Studio Compilers you could try giving the
      compilers the -xmemalign=8i option when building the benchmark and
      see if that resolves the issue.  This would help to assure the
      issue is just an alignment of data we are slicing and dicing as
      opposed to wrongly addressing memory.
>>
>>
>>After thinking about this you probably won't be able to use the Studio Compilers because they only support compiling on Linux with x86 platforms not Linux with SPARC.  Not sure if gcc has anything like the xmemalign options.

Hello Terry,

we have no Solaris on the machines (anymore). The whole effort is to get Linux running on them...
With big help of Roland Dreier and patches of David Miller it seems as if the Infiniband drivers work on our SPARC64 machines with Debian now. The only big thing from the OFED which now lacks is OpenMPI.

--
BTW:
With Debian's gcc (Debian 4.4.5-8) I've build this new environment:
- binutils-2.21.1 (from gnu.org)
- gcc-4.4.6 (from gnu.org)
- libtool-2.2.6b (from gnu.org)
This new environment I used to build:
- linux-2.6.39.4 (from kernel.org)
- OFED-1.5.3.2 with openmpi-1.4.3, the ofa kernel modules etc. (from openfabrics.org)
- openmpi-1.4.4 (from open-mpi.org)
---
You asked for debugging information. Here you can see a screen shot of kdbg with the stack, the line number etc.
http://net.razik.de/linux/T5120/kdbg-openmpi-1.4.4-osu_latency.png
That's the backtrace of the core file made by gdb:
---
# gdb /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency core
<snip>
Reading symbols from /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency...(no debugging symbols found)...done.
[New LWP 54054]
[New LWP 54055]
[New LWP 54056]
[Thread debugging using libthread_db enabled]
Core was generated by `/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency'.
Program terminated with signal 10, Bus error.
#0  0xfffff8010229ba9c in mca_pml_ob1_send_request_start_copy (sendreq=0xb23200, bml_btl=0xb29050, size=0) at pml_ob1_sendreq.c:551
551         hdr->hdr_match.hdr_ctx = sendreq->req_send.req_base.req_comm->c_contextid;
(gdb) backtrace
#0  0xfffff8010229ba9c in mca_pml_ob1_send_request_start_copy (sendreq=0xb23200, bml_btl=0xb29050, size=0) at pml_ob1_sendreq.c:551
#1  0xfffff80102286d28 in mca_pml_ob1_send_request_start_btl (sendreq=0xb23200, bml_btl=0xb29050) at pml_ob1_sendreq.h:363
#2  0xfffff80102287050 in mca_pml_ob1_send_request_start (sendreq=0xb23200) at pml_ob1_sendreq.h:429
#3  0xfffff801022879ec in mca_pml_ob1_isend (buf=0x0, count=0, datatype=0xfffff80100290dc0, dst=1, tag=-16,
    sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x201b50, request=0x7feffa130c8) at pml_ob1_isend.c:87
#4  0xfffff8010343d338 in ompi_coll_tuned_sendrecv_actual (sendbuf=0x0, scount=0, sdatatype=0xfffff80100290dc0, dest=1, stag=-16,
    recvbuf=0x0, rcount=0, rdatatype=0xfffff80100290dc0, source=1, rtag=-16, comm=0x201b50, status=0x0) at coll_tuned_util.c:51
#5  0xfffff8010344fd94 in ompi_coll_tuned_barrier_intra_two_procs (comm=0x201b50, module=0xb2b070) at coll_tuned_barrier.c:258
#6  0xfffff8010343de94 in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x201b50, module=0xb2b070) at coll_tuned_decision_fixed.c:192
#7  0xfffff801000bfff0 in PMPI_Barrier (comm=0x201b50) at pbarrier.c:59
#8  0x0000000000100f3c in main ()
---
That's the belonging mpirun:
---
# /usr/mpi/gcc/openmpi-1.4.4/bin/mpirun -np 2 -host cluster1,cluster2 /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency
# OSU MPI Latency Test v3.1.1
# Size            Latency (us)
[cluster1:54054] *** Process received signal ***
[cluster1:54054] Signal: Bus error (10)
[cluster1:54054] Signal code: Invalid address alignment (1)
[cluster1:54054] Failing at address: 0xad7393
[cluster1:54054] [ 0] /usr/mpi/gcc/openmpi-1.4.4/lib/openmpi/mca_pml_ob1.so(+0xed20) [0xfffff80102286d20]
[cluster1:54054] [ 1] /usr/mpi/gcc/openmpi-1.4.4/lib/openmpi/mca_pml_ob1.so(+0xf048) [0xfffff80102287048]
[cluster1:54054] [ 2] /usr/mpi/gcc/openmpi-1.4.4/lib/openmpi/mca_pml_ob1.so(+0xf9e4) [0xfffff801022879e4]
[cluster1:54054] [ 3] /usr/mpi/gcc/openmpi-1.4.4/lib/openmpi/mca_coll_tuned.so(+0x5330) [0xfffff8010343d330]
[cluster1:54054] [ 4] /usr/mpi/gcc/openmpi-1.4.4/lib/openmpi/mca_coll_tuned.so(+0x17d8c) [0xfffff8010344fd8c]
[cluster1:54054] [ 5] /usr/mpi/gcc/openmpi-1.4.4/lib/openmpi/mca_coll_tuned.so(+0x5e8c) [0xfffff8010343de8c]
[cluster1:54054] [ 6] /usr/mpi/gcc/openmpi-1.4.4/lib/libmpi.so.0(MPI_Barrier+0x164) [0xfffff801000bffe8]
[cluster1:54054] [ 7] /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(main+0x2b0) [0x100f34]
[cluster1:54054] [ 8] /lib64/libc.so.6(__libc_start_main+0x100) [0xfffff80100c49240]
[cluster1:54054] [ 9] /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(_start+0x2c) [0x100bac]
[cluster1:54054] *** End of error message ***
[cluster2:04708] *** Process received signal ***
[cluster2:04708] Signal: Bus error (10)
[cluster2:04708] Signal code: Invalid address alignment (1)
[cluster2:04708] Failing at address: 0xad7393
[cluster2:04708] [ 0] /usr/mpi/gcc/openmpi-1.4.4/lib/openmpi/mca_pml_ob1.so(+0xed20) [0xfffff80102286d20]
[cluster2:04708] [ 1] /usr/mpi/gcc/openmpi-1.4.4/lib/openmpi/mca_pml_ob1.so(+0xf048) [0xfffff80102287048]
[cluster2:04708] [ 2] /usr/mpi/gcc/openmpi-1.4.4/lib/openmpi/mca_pml_ob1.so(+0xf9e4) [0xfffff801022879e4]
[cluster2:04708] [ 3] /usr/mpi/gcc/openmpi-1.4.4/lib/openmpi/mca_coll_tuned.so(+0x5330) [0xfffff8010343d330]
[cluster2:04708] [ 4] /usr/mpi/gcc/openmpi-1.4.4/lib/openmpi/mca_coll_tuned.so(+0x17d8c) [0xfffff8010344fd8c]
[cluster2:04708] [ 5] /usr/mpi/gcc/openmpi-1.4.4/lib/openmpi/mca_coll_tuned.so(+0x5e8c) [0xfffff8010343de8c]
[cluster2:04708] [ 6] /usr/mpi/gcc/openmpi-1.4.4/lib/libmpi.so.0(MPI_Barrier+0x164) [0xfffff801000bffe8]
[cluster2:04708] [ 7] /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(main+0x2b0) [0x100f34]
[cluster2:04708] [ 8] /lib64/libc.so.6(__libc_start_main+0x100) [0xfffff80100c49240]
[cluster2:04708] [ 9] /usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(_start+0x2c) [0x100bac]
[cluster2:04708] *** End of error message ***
---
I hope this can help you. Otherwise please ask.
I'll provide all information as fast as I can.
Many thanks for you time!
Best regards,
Lukas
PS: The whole discussion you find here:
http://www.open-mpi.org/community/lists/devel/2011/11/subject.php