All,
I am upgrading from 1.4.1 to 1.4.2 on both a cluster with IB and one without.
I have no problem on the GE cluster without IB which requires no special configure
options for the IB. 1.4.2 works perfectly there with both the latest Intel and PGI
compiler.
On the IB system 1.4.1 has worked fine with the following configure line:
./configure CC=icc CXX=icpc F77=ifort FC=ifort --enable-openib-ibcm --with-openib --prefix=/share/apps/openmpi-intel/1.4.1 --with-tm=/share/apps/pbs/10.1.0.91350
I have now built 1.4.2. with the almost identical:
$ ./configure CC=icc CXX=icpc F77=ifort FC=ifort --enable-openib-ibcm --with-openib --prefix=/share/apps/openmpi-intel/1.4.2 --with-tm=/share/apps/pbs/default
When I run a basic MPI test program with:
/share/apps/openmpi-intel/1.4.2/bin/mpirun -np 16 -machinefile $PBS_NODEFILE ./hello_mpi.exe
which defaults to using the IB switch, or with:
/share/apps/openmpi-intel/1.4.2/bin/mpirun -mca btl tcp,self -np 16 -machinefile $PBS_NODEFILE ./hello_mpi.exe
which forces the use of GE, I get the same error:
[compute-0-3:22515] *** Process received signal ***
[compute-0-3:22515] Signal: Segmentation fault (11)
[compute-0-3:22515] Signal code: Address not mapped (1)
[compute-0-3:22515] Failing at address: 0x3f
[compute-0-3:22515] [ 0] /lib64/libpthread.so.0 [0x3639e0e7c0]
[compute-0-3:22515] [ 1] /share/apps/openmpi-intel/1.4.2/lib/openmpi/mca_plm_tm.so(discui_+0x84) [0x2b7b546dd3d0]
[compute-0-3:22515] [ 2] /share/apps/openmpi-intel/1.4.2/lib/openmpi/mca_plm_tm.so(diswsi+0xc3) [0x2b7b546da9e3]
[compute-0-3:22515] [ 3] /share/apps/openmpi-intel/1.4.2/lib/openmpi/mca_plm_tm.so [0x2b7b546d868c]
[compute-0-3:22515] [ 4] /share/apps/openmpi-intel/1.4.2/lib/openmpi/mca_plm_tm.so(tm_init+0x1fe) [0x2b7b546d8978]
[compute-0-3:22515] [ 5] /share/apps/openmpi-intel/1.4.2/lib/openmpi/mca_plm_tm.so [0x2b7b546d791c]
[compute-0-3:22515] [ 6] /share/apps/openmpi-intel/1.4.2/bin/mpirun [0x404c27]
[compute-0-3:22515] [ 7] /share/apps/openmpi-intel/1.4.2/bin/mpirun [0x403e38]
[compute-0-3:22515] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x363961d994]
[compute-0-3:22515] [ 9] /share/apps/openmpi-intel/1.4.2/bin/mpirun [0x403d69]
[compute-0-3:22515] *** End of error message ***
/var/spool/PBS/mom_priv/jobs/9909.bob.csi.cuny.edu.SC: line 42: 22515 Segmentation fault /share/apps/openmpi-intel/1.4.2/bin/mpirun -mca btl tcp,self -np 16 -machinefile $PBS_NODEFILE ./hello_mpi.exe
When compiling with the PGI compiler suite I get the same result
although the traceback gives less detail. I notice postings that suggest
the if I disable the memory-manager I might be able to get around
this problem, but that will result in a performance hit on this IB
system.
Have others seen this? Suggestions?
Thanks,
Richard Walsh
CUNY HPC Center
Richard Walsh
Parallel Applications and Systems Manager
CUNY HPC Center, Staten Island, NY
718-982-3319
612-382-4620
Mighty the Wizard
Who found me at sunrise
Sleeping, and woke me
And learn'd me Magic!
Think green before you print this email.
|