Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Bus Error in ompi_free_list_grow
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-11-14 12:21:21


Could this issue actually be related to:

     http://www.open-mpi.org/community/lists/devel/2008/11/4882.php

(read through the thread to get to the error handling stuff)

On Nov 14, 2008, at 7:41 AM, Geraldo Veiga wrote:

> Thanks Peter. Blocking the shared memory layer did the trick for
> our program too.
>
> For the record, we also have SGI Propack 6 installed (sgi-propack-
> release-6-sgi600r3).
>
> Is the on-node shared memory support completely blocked? What if
> the MPI process calls a procedure that uses OpenMP threads, for
> instance?
>
>
> On Thu, Nov 13, 2008 at 1:44 PM, Peter Cebull <peter.cebull_at_[hidden]>
> wrote:
> Geraldo,
>
> The previous message you saw was for our Altix ICE system. Since we
> started seeing these errors after upgrading to SGI Propack 6, I
> wonder if there's a bug somewhere in the Propack software or an
> incompatibility between Open MPI and OFED 1.3 (we had no problems
> under OFED 1.2). A workaround I stumbled across is to turn off the
> sm component:
>
> mpirun --mca btl ^sm . . .
>
> That seems to allow our application to run, although I guess at the
> expense of losing on-node shared memory support.
>
> Peter
>
> Geraldo Veiga wrote:
> Hi to all,
>
> I am using the same subject of a recent message I found in the list
> archives of this mailing list:
>
> http://www.open-mpi.org/community/lists/users/2008/10/7025.php
>
> There was no follow-up on that one, but will add this similar
> report in case a list member can give us an idea of how to correct
> it. Or whose bug this could be.
>
> My application behaves as expected when I run it in a single host
> and multiple MPI nodes of our SGI Altix ICE 8200 cluster with in
> InfiniBand. When I try the same with multiple hosts, using the PBS
> batch system the program terminates with a segmentation fault:
>
> -------
> [r1i0n9:09192] *** Process received signal ***
> [r1i0n9:09192] Signal: Bus error (7)
> [r1i0n9:09192] Signal code: (2)
> [r1i0n9:09192] Failing at address: 0x2b67ca0c8c20
> [r1i0n9:09192] [ 0] /lib64/libpthread.so.0 [0x2b67bfdb1c00]
> [r1i0n9:09192] [ 1] /sw/openmpi_intel/1.2.8/lib/libmpi.so.
> 0(ompi_free_list_grow+0x14a) [0x2b67bf499b38]
> [r1i0n9:09192] [ 2] /sw/openmpi_intel/1.2.8/lib/openmpi/
> mca_btl_sm.so(mca_btl_sm_alloc+0x321) [0x2b67c3a43e15]
> [r1i0n9:09192] [ 3] /sw/openmpi_intel/1.2.8/lib/openmpi/
> mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x26d)
> [0x2b67c34e9041]
> [r1i0n9:09192] [ 4] /sw/openmpi_intel/1.2.8/lib/openmpi/
> mca_pml_ob1.so(mca_pml_ob1_isend+0x540) [0x2b67c34e17ec]
> [r1i0n9:09192] [ 5] /sw/openmpi_intel/1.2.8/lib/libmpi.so.
> 0(PMPI_Isend+0x63) [0x2b67bf4dcd1f]
> [r1i0n9:09192] [ 6] /sw/openmpi_intel/1.2.8/lib/libmpi_f77.so.
> 0(pmpi_isend+0x8f) [0x2b67bf36a03f]
> [r1i0n9:09192] [ 7] dsimpletest(dmumps_comm_buffer_mp_dmumps_519_
> +0x449) [0x53e19b]
> [r1i0n9:09192] [ 8] dsimpletest(dmumps_load_mp_dmumps_512_+0x20b)
> [0x54fda1]
> [r1i0n9:09192] [ 9] dsimpletest(dmumps_251_+0x4995) [0x4d273b]
> [r1i0n9:09192] [10] dsimpletest(dmumps_244_+0x808) [0x484e38]
> [r1i0n9:09192] [11] dsimpletest(dmumps_142_+0x8717) [0x4bf5eb]
> [r1i0n9:09192] [12] dsimpletest(dmumps_+0x1554) [0x43a720]
> [r1i0n9:09192] [13] dsimpletest(MAIN__+0x50b) [0x41e4c3]
> [r1i0n9:09192] [14] dsimpletest(main+0x3c) [0x683d4c]
> [r1i0n9:09192] [15] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x2b67bfeda184]
> [r1i0n9:09192] [16] dsimpletest(dtrmv_+0xa1) [0x41df29]
> [r1i0n9:09192] *** End of error message ***
> ----
>
> Most of the software infrastructure is provided by the Intel
> propack. Any hints of where to look further into this bug?
>
> Thanks in advance.
>
> --
> Geraldo Veiga <gveiga_at_[hidden] <mailto:gveiga_at_[hidden]>>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Peter Cebull
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> Geraldo Veiga <gveiga_at_[hidden]>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems