I’ve dug a little deeper and thing the problem has something to do with 10MB sized /tmp filesystem.

 

[bloscel@k1n11 ~]$ df -h

Filesystem            Size  Used Avail Use% Mounted on

compute_x86_64         32G  1.1G   31G   4% /

tmpfs                  32G     0   32G   0% /dev/shm

tmpfs                  10M   80K   10M   1% /tmp

tmpfs                  10M     0   10M   0% /var/tmp

/dev/lb                53T  109G   53T   1% /gpfs/lb

/dev/sb               3.3T   38G  3.3T   2% /gpfs/sb

 

[bloscel@k1n11 ~]$ mktemp

/tmp/tmp.L8owhNH1AN

 

[bloscel@k1n11 ~]$ ompi_info -a | grep /dev/shm

               MCA shmem: parameter "shmem_mmap_backing_file_base_dir" (current value: </dev/shm>, data source: default value)

 

[bloscel@k1n11 ~]$ ompi_info -a | grep orte_tmpdir_base

                MCA orte: parameter "orte_tmpdir_base" (current value: <none>, data source: default value)

[bloscel@k1n11 ~]$

 

From: users-bounces@open-mpi.org [mailto:users-bounces@open-mpi.org] On Behalf Of Blosch, Edwin L
Sent: Wednesday, June 05, 2013 11:14 AM
To: Open MPI Users (users@open-mpi.org)
Subject: EXTERNAL: [OMPI users] How to diagnose bus error with 1.6.4

 

I am running into a bus error that does not happen with MVAPICH, and I am guessing it has something to do with shared-memory communication.  Has anyone had a similar experience or have any insights on what this could be?   

 

Thanks

 

[k1n08:12688] mca: base: components_open: Looking for shmem components

[k1n08:12688] mca: base: components_open: opening shmem components

[k1n08:12688] mca: base: components_open: found loaded component mmap

[k1n08:12688] mca: base: components_open: component mmap register function successful

[k1n08:12688] mca: base: components_open: component mmap open function successful

[k1n08:12688] mca: base: components_open: found loaded component posix

[k1n08:12688] mca: base: components_open: component posix has no register function

[k1n08:12688] mca: base: components_open: component posix open function successful

[k1n08:12688] mca: base: components_open: found loaded component sysv

[k1n08:12688] mca: base: components_open: component sysv has no register function

[k1n08:12688] mca: base: components_open: component sysv open function successful

[k1n08:12688] shmem: base: runtime_query: Auto-selecting shmem components

[k1n08:12688] shmem: base: runtime_query: (shmem) Querying component (run-time) [mmap]

[k1n08:12688] shmem: base: runtime_query: (shmem) Query of component [mmap] set priority to 50

[k1n08:12688] shmem: base: runtime_query: (shmem) Querying component (run-time) [posix]

[k1n08:12688] shmem: base: runtime_query: (shmem) Skipping component [posix]. Run-time Query failed to return a module

[k1n08:12688] shmem: base: runtime_query: (shmem) Querying component (run-time) [sysv]

[k1n08:12688] shmem: base: runtime_query: (shmem) Skipping component [sysv]. Run-time Query failed to return a module

[k1n08:12688] shmem: base: runtime_query: (shmem) Selected component [mmap]

[k1n08:12688] mca: base: close: unloading component posix

[k1n08:12688] mca: base: close: unloading component sysv

[k1n08:12688] *** Process received signal ***

[k1n08:12688] Signal: Bus error (7)

[k1n08:12688] Signal code: Non-existant physical address (2)

[k1n08:12688] Failing at address: 0x2ac1e088e030

[k1n08:12688] [ 0] /lib64/libpthread.so.0(+0xf500) [0x2ac1de7c0500]

[k1n08:12688] [ 1] /applocal/cfd/test/bin/test_openmpi(__intel_ssse3_rep_memcpy+0xcdb) [0x1495cab]

[k1n08:12688] [ 2] /applocal/cfd/test/bin/test_openmpi(opal_convertor_pack+0x101) [0x125c111]

[k1n08:12688] [ 3] /applocal/cfd/test/bin/test_openmpi(mca_btl_sm_prepare_src+0xc5) [0x13aab25]

[k1n08:12688] [ 4] /applocal/cfd/test/bin/test_openmpi(mca_pml_ob1_send_request_start_rndv+0x67) [0x12fa9a7]

[k1n08:12688] [ 5] /applocal/cfd/test/bin/test_openmpi(mca_pml_ob1_isend+0x3ab) [0x12ef02b]

[k1n08:12688] [ 6] /applocal/cfd/test/bin/test_openmpi(ompi_coll_tuned_sendrecv_actual+0x94) [0x12d67f4]

[k1n08:12688] [ 7] /applocal/cfd/test/bin/test_openmpi(ompi_coll_tuned_bcast_intra_split_bintree+0x94d) [0x12d45fd]

[k1n08:12688] [ 8] /applocal/cfd/test/bin/test_openmpi(ompi_coll_tuned_bcast_intra_dec_fixed+0x143) [0x12d5dd3]

[k1n08:12688] [ 9] /applocal/cfd/test/bin/test_openmpi(mca_coll_sync_bcast+0x66) [0x12d6aa6]

[k1n08:12688] [10] /applocal/cfd/test/bin/test_openmpi(MPI_Bcast+0x5a) [0x11f95da]

[k1n08:12688] [11] /applocal/cfd/test/bin/test_openmpi(mpi_bcast_f+0x6e) [0x11dca5e]

[k1n08:12688] [12] /applocal/cfd/test/bin/test_openmpi(wpf_calc_mod_mp_wpf_calc_+0x10f0) [0x541be0]

[k1n08:12688] [13] /applocal/cfd/test/bin/test_openmpi(special_init_mod_mp_special_init_geom_+0x3f4) [0x683254]

[k1n08:12688] [14] /applocal/cfd/test/bin/test_openmpi(setup_mod_mp_setup_domains_+0x56b) [0x53effb]

[k1n08:12688] [15] /applocal/cfd/test/bin/test_openmpi(MAIN__+0x1ab7) [0x5e8be7]

[k1n08:12688] [16] /applocal/cfd/test/bin/test_openmpi(main+0x3c) [0x4ff82c]

[k1n08:12688] [17] /lib64/libc.so.6(__libc_start_main+0xfd) [0x2ac1de9eccdd]

[k1n08:12688] [18] /applocal/cfd/test/bin/test_openmpi() [0x4ff729]

[k1n08:12688] *** End of error message ***