Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] seg fault with intel compiler
From: Gus Correa (gus_at_[hidden])
Date: 2012-06-01 16:39:31


Hi Edmund

The [Torque/PBS] syntax '-l procs=48' is somewhat troublesome,
and may not be understood by the scheduler [It doesn't
work correctly with Maui, which is what we have here. I read
people saying it works with pbs_sched and with Moab,
but that's hearsay.]
This issue comes back very often in the Torque mailing
list.

Have you tried instead this alternate syntax?

'-l nodes=2:ppn=24'

[I am assuming here that your
nodes have 24 cores, i.e. 24 'ppn', each]

Then in the script:
mpiexec -np 48 ./your_program

Also, in your PBS script you could print
the contents of PBS_NODEFILE.

cat $PBS_NODEFILE

A simple troubleshooting test is to launch 'hostname'
with mpirun

mpirun -np 48 hostname

Finally, are you sure that the OpenMPI you are using was
compiled with Torque support?
If not, I wonder if clauses like '-bynode' would work at all.
Jeff may correct me if I am wrong, but if your
OpenMPI lacks Torque support,
you may need to pass to mpirun
the $PBS_NODEFILE as your hostfile.

I hope this helps,
Gus Correa

On 06/01/2012 11:26 AM, Edmund Sumbar wrote:
> On Fri, Jun 1, 2012 at 8:09 AM, Jeff Squyres <jsquyres_at_[hidden]
> <mailto:jsquyres_at_[hidden]>> wrote:
>
> It's been a loooong time since I've run under PBS, so I don't
> remember if your script's environment is copied out to the remote
> nodes where your application actually runs.
>
> Can you verify that PATH and LD_LIBRARY_PATH are the same on all
> nodes in your PBS allocation after you module load?
>
>
> I compiled the following program and invoked it with "mpiexec -bynode
> ./test-env" in a PBS script.
>
> #include "mpi.h"
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
>
> int main (int argc, char *argv[])
> {
> int i, rank, size, namelen;
> MPI_Status stat;
>
> MPI_Init (&argc, &argv);
>
> MPI_Comm_size (MPI_COMM_WORLD, &size);
> MPI_Comm_rank (MPI_COMM_WORLD, &rank);
>
> printf("rank: %d: ld_library_path: %s\n", rank,
> getenv("LD_LIBRARY_PATH"));
>
> MPI_Finalize ();
>
> return (0);
> }
>
> I submitted the script with "qsub -l procs=24 job.pbs", and got
>
> rank: 4: ld_library_path:
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64
>
> rank: 3: ld_library_path:
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64
>
> ...more of the same...
>
> When I submitted it with -l procs=48, I got
>
> [cl2n004:11617] *** Process received signal ***
> [cl2n004:11617] Signal: Segmentation fault (11)
> [cl2n004:11617] Signal code: Address not mapped (1)
> [cl2n004:11617] Failing at address: 0x10
> [cl2n004:11617] [ 0] /lib64/libpthread.so.0 [0x376ca0ebe0]
> [cl2n004:11617] [ 1]
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3)
> [0x2af788a98113]
> [cl2n004:11617] [ 2]
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59)
> [0x2af788a9a8a9]
> [cl2n004:11617] [ 3]
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1
> [0x2af788a9a596]
> [cl2n004:11617] [ 4]
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so
> [0x2af78c916654]
> [cl2n004:11617] [ 5] /lib64/libpthread.so.0 [0x376ca0677d]
> [cl2n004:11617] [ 6] /lib64/libc.so.6(clone+0x6d) [0x376bed325d]
> [cl2n004:11617] *** End of error message ***
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 4 with PID 11617 on node cl2n004
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
> It seems that failures happen for arbitrary reasons. When I added a line
> in the PBS script to print out the node allocation, the procs=24 case
> failed, but then it worked a few seconds later, with the same list of
> allocated nodes. So there's definitely something amiss with the cluster,
> although I wouldn't know where to start investigating. Perhaps there is
> a pre-installed OMPI somewhere that's interfering, but I'm doubtful.
>
> By the way, thanks for all the support.
>
> --
> Edmund Sumbar
> University of Alberta
> +1 780 492 9360
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users