Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Bus Error in ompi_free_list_grow
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-11-14 13:23:00


If this is the problem, that's good -- it just means that we need a
better error detection in the case where we run out of memory, etc.
Stay tuned to that thread to see what happens.

On Nov 14, 2008, at 1:14 PM, Peter Cebull wrote:

> Jeff Squyres wrote:
>> Could this issue actually be related to:
>>
>> http://www.open-mpi.org/community/lists/devel/2008/11/4882.php
>>
>> (read through the thread to get to the error handling stuff)
> You might be right that this issue is the problem. Our system has
> diskless nodes, so /tmp uses a ramdisk. It was initially configured
> so that /tmp could use up to 8 GB of the 16 GB of memory on each
> node. We didn't notice until recently that something in the upgrade
> we made to the system dropped the size of /tmp to 48 MB, so maybe
> that's the cause of the problem. We've increased the size of /tmp
> again in the compute node image, but I'll have to wait until we get
> a chance to push out the new image before I can tell if that will
> fix our problem.
>
> Thanks!
> Peter
>>
>>
>>
>> On Nov 14, 2008, at 7:41 AM, Geraldo Veiga wrote:
>>
>>> Thanks Peter. Blocking the shared memory layer did the trick for
>>> our program too.
>>>
>>> For the record, we also have SGI Propack 6 installed (sgi-propack-
>>> release-6-sgi600r3).
>>>
>>> Is the on-node shared memory support completely blocked? What if
>>> the MPI process calls a procedure that uses OpenMP threads, for
>>> instance?
>>>
>>>
>>> On Thu, Nov 13, 2008 at 1:44 PM, Peter Cebull
>>> <peter.cebull_at_[hidden]> wrote:
>>> Geraldo,
>>>
>>> The previous message you saw was for our Altix ICE system. Since
>>> we started seeing these errors after upgrading to SGI Propack 6, I
>>> wonder if there's a bug somewhere in the Propack software or an
>>> incompatibility between Open MPI and OFED 1.3 (we had no problems
>>> under OFED 1.2). A workaround I stumbled across is to turn off the
>>> sm component:
>>>
>>> mpirun --mca btl ^sm . . .
>>>
>>> That seems to allow our application to run, although I guess at
>>> the expense of losing on-node shared memory support.
>>>
>>> Peter
>>>
>>> Geraldo Veiga wrote:
>>> Hi to all,
>>>
>>> I am using the same subject of a recent message I found in the
>>> list archives of this mailing list:
>>>
>>> http://www.open-mpi.org/community/lists/users/2008/10/7025.php
>>>
>>> There was no follow-up on that one, but will add this similar
>>> report in case a list member can give us an idea of how to correct
>>> it. Or whose bug this could be.
>>>
>>> My application behaves as expected when I run it in a single host
>>> and multiple MPI nodes of our SGI Altix ICE 8200 cluster with
>>> in InfiniBand. When I try the same with multiple hosts, using
>>> the PBS batch system the program terminates with a segmentation
>>> fault:
>>>
>>> -------
>>> [r1i0n9:09192] *** Process received signal ***
>>> [r1i0n9:09192] Signal: Bus error (7)
>>> [r1i0n9:09192] Signal code: (2)
>>> [r1i0n9:09192] Failing at address: 0x2b67ca0c8c20
>>> [r1i0n9:09192] [ 0] /lib64/libpthread.so.0 [0x2b67bfdb1c00]
>>> [r1i0n9:09192] [ 1] /sw/openmpi_intel/1.2.8/lib/libmpi.so.
>>> 0(ompi_free_list_grow+0x14a) [0x2b67bf499b38]
>>> [r1i0n9:09192] [ 2] /sw/openmpi_intel/1.2.8/lib/openmpi/
>>> mca_btl_sm.so(mca_btl_sm_alloc+0x321) [0x2b67c3a43e15]
>>> [r1i0n9:09192] [ 3] /sw/openmpi_intel/1.2.8/lib/openmpi/
>>> mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x26d)
>>> [0x2b67c34e9041]
>>> [r1i0n9:09192] [ 4] /sw/openmpi_intel/1.2.8/lib/openmpi/
>>> mca_pml_ob1.so(mca_pml_ob1_isend+0x540) [0x2b67c34e17ec]
>>> [r1i0n9:09192] [ 5] /sw/openmpi_intel/1.2.8/lib/libmpi.so.
>>> 0(PMPI_Isend+0x63) [0x2b67bf4dcd1f]
>>> [r1i0n9:09192] [ 6] /sw/openmpi_intel/1.2.8/lib/libmpi_f77.so.
>>> 0(pmpi_isend+0x8f) [0x2b67bf36a03f]
>>> [r1i0n9:09192] [ 7] dsimpletest(dmumps_comm_buffer_mp_dmumps_519_
>>> +0x449) [0x53e19b]
>>> [r1i0n9:09192] [ 8] dsimpletest(dmumps_load_mp_dmumps_512_+0x20b)
>>> [0x54fda1]
>>> [r1i0n9:09192] [ 9] dsimpletest(dmumps_251_+0x4995) [0x4d273b]
>>> [r1i0n9:09192] [10] dsimpletest(dmumps_244_+0x808) [0x484e38]
>>> [r1i0n9:09192] [11] dsimpletest(dmumps_142_+0x8717) [0x4bf5eb]
>>> [r1i0n9:09192] [12] dsimpletest(dmumps_+0x1554) [0x43a720]
>>> [r1i0n9:09192] [13] dsimpletest(MAIN__+0x50b) [0x41e4c3]
>>> [r1i0n9:09192] [14] dsimpletest(main+0x3c) [0x683d4c]
>>> [r1i0n9:09192] [15] /lib64/libc.so.6(__libc_start_main+0xf4)
>>> [0x2b67bfeda184]
>>> [r1i0n9:09192] [16] dsimpletest(dtrmv_+0xa1) [0x41df29]
>>> [r1i0n9:09192] *** End of error message ***
>>> ----
>>>
>>> Most of the software infrastructure is provided by the Intel
>>> propack. Any hints of where to look further into this bug?
>>>
>>> Thanks in advance.
>>>
>>> --
>>> Geraldo Veiga <gveiga_at_[hidden] <mailto:gveiga_at_[hidden]>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Peter Cebull
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> --
>>> Geraldo Veiga <gveiga_at_[hidden]>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
>
> --
> Peter Cebull
> Idaho National Laboratory
> P.O. Box 1625, MS3605
> Idaho Falls, ID 83415
> Phone: 208-526-1909
> Email: Peter.Cebull_at_[hidden]
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems