Thanks Peter. Blocking the shared memory layer did the trick for our program too.
For the record, we also have SGI Propack 6 installed (sgi-propack-release-6-sgi600r3).
Is the on-node shared memory support completely blocked? What if the MPI process calls a procedure that uses OpenMP threads, for instance?
Geraldo,
The previous message you saw was for our Altix ICE system. Since we started seeing these errors after upgrading to SGI Propack 6, I wonder if there's a bug somewhere in the Propack software or an incompatibility between Open MPI and OFED 1.3 (we had no problems under OFED 1.2). A workaround I stumbled across is to turn off the sm component:
mpirun --mca btl ^sm . . .
That seems to allow our application to run, although I guess at the expense of losing on-node shared memory support.
Peter
Geraldo Veiga wrote:
Geraldo Veiga <gveiga@gmail.com <mailto:gveiga@gmail.com>>Hi to all,
I am using the same subject of a recent message I found in the list archives of this mailing list:
http://www.open-mpi.org/community/lists/users/2008/10/7025.php
There was no follow-up on that one, but will add this similar report in case a list member can give us an idea of how to correct it. Or whose bug this could be.
My application behaves as expected when I run it in a single host and multiple MPI nodes of our SGI Altix ICE 8200 cluster with in InfiniBand. When I try the same with multiple hosts, using the PBS batch system the program terminates with a segmentation fault:
-------
[r1i0n9:09192] *** Process received signal ***
[r1i0n9:09192] Signal: Bus error (7)
[r1i0n9:09192] Signal code: (2)
[r1i0n9:09192] Failing at address: 0x2b67ca0c8c20
[r1i0n9:09192] [ 0] /lib64/libpthread.so.0 [0x2b67bfdb1c00]
[r1i0n9:09192] [ 1] /sw/openmpi_intel/1.2.8/lib/libmpi.so.0(ompi_free_list_grow+0x14a) [0x2b67bf499b38]
[r1i0n9:09192] [ 2] /sw/openmpi_intel/1.2.8/lib/openmpi/mca_btl_sm.so(mca_btl_sm_alloc+0x321) [0x2b67c3a43e15]
[r1i0n9:09192] [ 3] /sw/openmpi_intel/1.2.8/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x26d) [0x2b67c34e9041]
[r1i0n9:09192] [ 4] /sw/openmpi_intel/1.2.8/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x540) [0x2b67c34e17ec]
[r1i0n9:09192] [ 5] /sw/openmpi_intel/1.2.8/lib/libmpi.so.0(PMPI_Isend+0x63) [0x2b67bf4dcd1f]
[r1i0n9:09192] [ 6] /sw/openmpi_intel/1.2.8/lib/libmpi_f77.so.0(pmpi_isend+0x8f) [0x2b67bf36a03f]
[r1i0n9:09192] [ 7] dsimpletest(dmumps_comm_buffer_mp_dmumps_519_+0x449) [0x53e19b]
[r1i0n9:09192] [ 8] dsimpletest(dmumps_load_mp_dmumps_512_+0x20b) [0x54fda1]
[r1i0n9:09192] [ 9] dsimpletest(dmumps_251_+0x4995) [0x4d273b]
[r1i0n9:09192] [10] dsimpletest(dmumps_244_+0x808) [0x484e38]
[r1i0n9:09192] [11] dsimpletest(dmumps_142_+0x8717) [0x4bf5eb]
[r1i0n9:09192] [12] dsimpletest(dmumps_+0x1554) [0x43a720]
[r1i0n9:09192] [13] dsimpletest(MAIN__+0x50b) [0x41e4c3]
[r1i0n9:09192] [14] dsimpletest(main+0x3c) [0x683d4c]
[r1i0n9:09192] [15] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b67bfeda184]
[r1i0n9:09192] [16] dsimpletest(dtrmv_+0xa1) [0x41df29]
[r1i0n9:09192] *** End of error message ***
----
Most of the software infrastructure is provided by the Intel propack. Any hints of where to look further into this bug?
Thanks in advance.
--
------------------------------------------------------------------------
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Peter Cebull
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users