The previous message you saw was for our Altix ICE system. Since we
started seeing these errors after upgrading to SGI Propack 6, I wonder
if there's a bug somewhere in the Propack software or an incompatibility
between Open MPI and OFED 1.3 (we had no problems under OFED 1.2). A
workaround I stumbled across is to turn off the sm component:
mpirun --mca btl ^sm . . .
That seems to allow our application to run, although I guess at the
expense of losing on-node shared memory support.
Geraldo Veiga wrote:
> Hi to all,
> I am using the same subject of a recent message I found in the list
> archives of this mailing list:
> There was no follow-up on that one, but will add this similar report
> in case a list member can give us an idea of how to correct it. Or
> whose bug this could be.
> My application behaves as expected when I run it in a single host and
> multiple MPI nodes of our SGI Altix ICE 8200 cluster with in
> InfiniBand. When I try the same with multiple hosts, using the PBS
> batch system the program terminates with a segmentation fault:
> [r1i0n9:09192] *** Process received signal ***
> [r1i0n9:09192] Signal: Bus error (7)
> [r1i0n9:09192] Signal code: (2)
> [r1i0n9:09192] Failing at address: 0x2b67ca0c8c20
> [r1i0n9:09192] [ 0] /lib64/libpthread.so.0 [0x2b67bfdb1c00]
> [r1i0n9:09192] [ 1]
> [r1i0n9:09192] [ 2]
> [r1i0n9:09192] [ 3]
> [r1i0n9:09192] [ 4]
> [r1i0n9:09192] [ 5]
> /sw/openmpi_intel/1.2.8/lib/libmpi.so.0(PMPI_Isend+0x63) [0x2b67bf4dcd1f]
> [r1i0n9:09192] [ 6]
> [r1i0n9:09192] [ 7]
> dsimpletest(dmumps_comm_buffer_mp_dmumps_519_+0x449) [0x53e19b]
> [r1i0n9:09192] [ 8] dsimpletest(dmumps_load_mp_dmumps_512_+0x20b)
> [r1i0n9:09192] [ 9] dsimpletest(dmumps_251_+0x4995) [0x4d273b]
> [r1i0n9:09192]  dsimpletest(dmumps_244_+0x808) [0x484e38]
> [r1i0n9:09192]  dsimpletest(dmumps_142_+0x8717) [0x4bf5eb]
> [r1i0n9:09192]  dsimpletest(dmumps_+0x1554) [0x43a720]
> [r1i0n9:09192]  dsimpletest(MAIN__+0x50b) [0x41e4c3]
> [r1i0n9:09192]  dsimpletest(main+0x3c) [0x683d4c]
> [r1i0n9:09192]  /lib64/libc.so.6(__libc_start_main+0xf4)
> [r1i0n9:09192]  dsimpletest(dtrmv_+0xa1) [0x41df29]
> [r1i0n9:09192] *** End of error message ***
> Most of the software infrastructure is provided by the Intel propack.
> Any hints of where to look further into this bug?
> Thanks in advance.
> Geraldo Veiga <gveiga_at_[hidden] <mailto:gveiga_at_[hidden]>>
> users mailing list