On Fri, Jun 1, 2012 at 5:00 AM, Jeff Squyres <jsquyres@cisco.com> wrote:
Try running:

which mpirun
ssh cl2n022 which mpirun
ssh cl2n010 which mpirun

and

ldd your_mpi_executable
ssh cl2n022 which mpirun
ssh cl2n010 which mpirun

Compare the results and ensure that you're finding the same mpirun on all nodes, and the same libmpi.so on all nodes.  There may well be another Open MPI installed in some non-default location of which you're unaware.

I'll try that Jeff (results given below). However, I suspect there must be something goofy about this (brand new) cluster itself because among the countless jobs that failed, I got one job that ran without error, and all I ever did was to rearrange the echo and which commands. We've also observed some peculiar behaviour on this cluster using Intel MPI that seemed to be associated with the number of tasks requested. And after more experimentation, the Open MPI version of the program also seems to be sensitive to the number of tasks (e.g., works with 48, fails with 64).

Thanks for the feedback Jeff, but I think the ball is firmly in my court.



I ran the following PBS script with "qsub -l procs=128 job.pbs". Environment variables are set using the Environment Modules packages.

echo $HOSTNAME
which mpiexec
module load library/openmpi/1.6-intel

which mpiexec
echo $PATH
echo $LD_LIBRARY_PATH
ldd test-ompi16
mpiexec --prefix /lustre/jasper/software/openmpi/openmpi-1.6-intel ./test-ompi16

Standard output gave

cl2n011

/lustre/jasper/software/openmpi/openmpi-1.6-intel/bin/mpiexec

/lustre/jasper/software/openmpi/openmpi-1.6-intel/bin:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/bin/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64:/home/esumbar/local/bin:/home/esumbar/bin:/usr/kerberos/bin:/bin:/usr/bin:/opt/sgi/sgimc/bin:/usr/local/torque/sbin:/usr/local/torque/bin

/lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64

    linux-vdso.so.1 =>  (0x00007fffb5358000)
    libmpi.so.1 => /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1 (0x00002b3968d1d000)
    libdl.so.2 => /lib64/libdl.so.2 (0x000000329ce00000)
    libimf.so => /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64/libimf.so (0x00002b3969137000)
    libm.so.6 => /lib64/libm.so.6 (0x000000329d200000)
    librt.so.1 => /lib64/librt.so.1 (0x000000329da00000)
    libnsl.so.1 => /lib64/libnsl.so.1 (0x00000032a6400000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00000032a8400000)
    libsvml.so => /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64/libsvml.so (0x00002b3969504000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000032a4c00000)
    libintlc.so.5 => /lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64/libintlc.so.5 (0x00002b3969c77000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x000000329d600000)
    libc.so.6 => /lib64/libc.so.6 (0x000000329ca00000)
    /lib64/ld-linux-x86-64.so.2 (0x000000329c200000)


Standard error gave

which: no mpiexec in (/home/esumbar/local/bin:/home/esumbar/bin:/usr/kerberos/bin:/bin:/usr/bin:/opt/sgi/sgimc/bin:/usr/local/torque/sbin:/usr/local/torque/bin)

[cl2n005:05142] *** Process received signal ***
[cl2n005:05142] Signal: Segmentation fault (11)
[cl2n005:05142] Signal code: Address not mapped (1)
[cl2n005:05142] Failing at address: 0x10
[cl2n005:05142] [ 0] /lib64/libpthread.so.0 [0x373180ebe0]
[cl2n005:05142] [ 1] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3) [0x2aff9aad5113]
[cl2n005:05142] [ 2] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59) [0x2aff9aad78a9]
[cl2n005:05142] [ 3] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1 [0x2aff9aad7596]
[cl2n005:05142] [ 4] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(ompi_free_list_grow+0x89) [0x2aff9aa0fa59]
[cl2n005:05142] [ 5] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(ompi_free_list_init_ex+0x9c) [0x2aff9aa0fd8c]
[cl2n005:05142] [ 6] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so [0x2aff9e94561c]
[cl2n005:05142] [ 7] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(mca_btl_base_select+0x130) [0x2aff9aa57930]
[cl2n005:05142] [ 8] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0xe) [0x2aff9e52bc1e]
[cl2n005:05142] [ 9] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(mca_bml_base_init+0x72) [0x2aff9aa570b2]
[cl2n005:05142] [10] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_pml_ob1.so [0x2aff9e1107e9]
[cl2n005:05142] [11] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(mca_pml_base_select+0x43e) [0x2aff9aa6592e]
[cl2n005:05142] [12] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(ompi_mpi_init+0x782) [0x2aff9aa276a2]
[cl2n005:05142] [13] /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(MPI_Init+0xf4) [0x2aff9aa3f884]
[cl2n005:05142] [14] ./test-ompi16(main+0x4c) [0x400b5c]
[cl2n005:05142] [15] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3730c1d994]
[cl2n005:05142] [16] ./test-ompi16 [0x400a59]
[cl2n005:05142] *** End of error message ***
[cl2n006:32362] [[58962,0],5] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/nidmap.c at line 776
[cl2n006:32362] [[58962,0],5] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ess_tm_module.c at line 310
[cl2n006:32362] [[58962,0],5] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line 2342
[cl2n003:04157] [[58962,0],8] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/nidmap.c at line 776
[cl2n003:04157] [[58962,0],8] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ess_tm_module.c at line 310
[cl2n003:04157] [[58962,0],8] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file base/odls_base_default_fns.c at line 2342
--------------------------------------------------------------------------
mpiexec noticed that process rank 77 with PID 5142 on node cl2n005 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------


--
Edmund Sumbar
University of Alberta
+1 780 492 9360