Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Bus Error in ompi_free_list_grow
From: Peter Cebull (peter.cebull_at_[hidden])
Date: 2008-11-13 10:44:33


Geraldo,

The previous message you saw was for our Altix ICE system. Since we
started seeing these errors after upgrading to SGI Propack 6, I wonder
if there's a bug somewhere in the Propack software or an incompatibility
between Open MPI and OFED 1.3 (we had no problems under OFED 1.2). A
workaround I stumbled across is to turn off the sm component:

mpirun --mca btl ^sm . . .

That seems to allow our application to run, although I guess at the
expense of losing on-node shared memory support.

Peter

Geraldo Veiga wrote:
> Hi to all,
>
> I am using the same subject of a recent message I found in the list
> archives of this mailing list:
>
> http://www.open-mpi.org/community/lists/users/2008/10/7025.php
>
> There was no follow-up on that one, but will add this similar report
> in case a list member can give us an idea of how to correct it. Or
> whose bug this could be.
>
> My application behaves as expected when I run it in a single host and
> multiple MPI nodes of our SGI Altix ICE 8200 cluster with in
> InfiniBand. When I try the same with multiple hosts, using the PBS
> batch system the program terminates with a segmentation fault:
>
> -------
> [r1i0n9:09192] *** Process received signal ***
> [r1i0n9:09192] Signal: Bus error (7)
> [r1i0n9:09192] Signal code: (2)
> [r1i0n9:09192] Failing at address: 0x2b67ca0c8c20
> [r1i0n9:09192] [ 0] /lib64/libpthread.so.0 [0x2b67bfdb1c00]
> [r1i0n9:09192] [ 1]
> /sw/openmpi_intel/1.2.8/lib/libmpi.so.0(ompi_free_list_grow+0x14a)
> [0x2b67bf499b38]
> [r1i0n9:09192] [ 2]
> /sw/openmpi_intel/1.2.8/lib/openmpi/mca_btl_sm.so(mca_btl_sm_alloc+0x321)
> [0x2b67c3a43e15]
> [r1i0n9:09192] [ 3]
> /sw/openmpi_intel/1.2.8/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x26d)
> [0x2b67c34e9041]
> [r1i0n9:09192] [ 4]
> /sw/openmpi_intel/1.2.8/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x540)
> [0x2b67c34e17ec]
> [r1i0n9:09192] [ 5]
> /sw/openmpi_intel/1.2.8/lib/libmpi.so.0(PMPI_Isend+0x63) [0x2b67bf4dcd1f]
> [r1i0n9:09192] [ 6]
> /sw/openmpi_intel/1.2.8/lib/libmpi_f77.so.0(pmpi_isend+0x8f)
> [0x2b67bf36a03f]
> [r1i0n9:09192] [ 7]
> dsimpletest(dmumps_comm_buffer_mp_dmumps_519_+0x449) [0x53e19b]
> [r1i0n9:09192] [ 8] dsimpletest(dmumps_load_mp_dmumps_512_+0x20b)
> [0x54fda1]
> [r1i0n9:09192] [ 9] dsimpletest(dmumps_251_+0x4995) [0x4d273b]
> [r1i0n9:09192] [10] dsimpletest(dmumps_244_+0x808) [0x484e38]
> [r1i0n9:09192] [11] dsimpletest(dmumps_142_+0x8717) [0x4bf5eb]
> [r1i0n9:09192] [12] dsimpletest(dmumps_+0x1554) [0x43a720]
> [r1i0n9:09192] [13] dsimpletest(MAIN__+0x50b) [0x41e4c3]
> [r1i0n9:09192] [14] dsimpletest(main+0x3c) [0x683d4c]
> [r1i0n9:09192] [15] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x2b67bfeda184]
> [r1i0n9:09192] [16] dsimpletest(dtrmv_+0xa1) [0x41df29]
> [r1i0n9:09192] *** End of error message ***
> ----
>
> Most of the software infrastructure is provided by the Intel propack.
> Any hints of where to look further into this bug?
>
> Thanks in advance.
>
> --
> Geraldo Veiga <gveiga_at_[hidden] <mailto:gveiga_at_[hidden]>>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Peter Cebull