Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] seg fault with intel compiler
From: Gus Correa (gus_at_[hidden])
Date: 2012-06-01 16:39:31


Hi Edmund

The [Torque/PBS] syntax '-l procs=48' is somewhat troublesome,
and may not be understood by the scheduler [It doesn't
work correctly with Maui, which is what we have here. I read
people saying it works with pbs_sched and with Moab,
but that's hearsay.]
This issue comes back very often in the Torque mailing
list.

Have you tried instead this alternate syntax?

'-l nodes=2:ppn=24'

[I am assuming here that your
nodes have 24 cores, i.e. 24 'ppn', each]

Then in the script:
mpiexec -np 48 ./your_program

Also, in your PBS script you could print
the contents of PBS_NODEFILE.

cat $PBS_NODEFILE

A simple troubleshooting test is to launch 'hostname'
with mpirun

mpirun -np 48 hostname

Finally, are you sure that the OpenMPI you are using was
compiled with Torque support?
If not, I wonder if clauses like '-bynode' would work at all.
Jeff may correct me if I am wrong, but if your
OpenMPI lacks Torque support,
you may need to pass to mpirun
the $PBS_NODEFILE as your hostfile.

I hope this helps,
Gus Correa

On 06/01/2012 11:26 AM, Edmund Sumbar wrote:
> On Fri, Jun 1, 2012 at 8:09 AM, Jeff Squyres <jsquyres_at_[hidden]
> <mailto:jsquyres_at_[hidden]>> wrote:
>
> It's been a loooong time since I've run under PBS, so I don't
> remember if your script's environment is copied out to the remote
> nodes where your application actually runs.
>
> Can you verify that PATH and LD_LIBRARY_PATH are the same on all
> nodes in your PBS allocation after you module load?
>
>
> I compiled the following program and invoked it with "mpiexec -bynode
> ./test-env" in a PBS script.
>
> #include "mpi.h"
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
>
> int main (int argc, char *argv[])
> {
> int i, rank, size, namelen;
> MPI_Status stat;
>
> MPI_Init (&argc, &argv);
>
> MPI_Comm_size (MPI_COMM_WORLD, &size);
> MPI_Comm_rank (MPI_COMM_WORLD, &rank);
>
> printf("rank: %d: ld_library_path: %s\n", rank,
> getenv("LD_LIBRARY_PATH"));
>
> MPI_Finalize ();
>
> return (0);
> }
>
> I submitted the script with "qsub -l procs=24 job.pbs", and got
>
> rank: 4: ld_library_path:
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64
>
> rank: 3: ld_library_path:
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/ipp/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mkl/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/compiler/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/tbb/lib/intel64:/home/esumbar/local/lib:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64/server:/usr/lib/jvm/jre-1.6.0-sun/lib/amd64:/opt/sgi/sgimc/lib:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/debugger/lib/intel64:/lustre/jasper/software/intel//l_ics_2012.0.032/composer_xe_2011_sp1.6.233/mpirt/lib/intel64
>
> ...more of the same...
>
> When I submitted it with -l procs=48, I got
>
> [cl2n004:11617] *** Process received signal ***
> [cl2n004:11617] Signal: Segmentation fault (11)
> [cl2n004:11617] Signal code: Address not mapped (1)
> [cl2n004:11617] Failing at address: 0x10
> [cl2n004:11617] [ 0] /lib64/libpthread.so.0 [0x376ca0ebe0]
> [cl2n004:11617] [ 1]
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3)
> [0x2af788a98113]
> [cl2n004:11617] [ 2]
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1(opal_memory_ptmalloc2_malloc+0x59)
> [0x2af788a9a8a9]
> [cl2n004:11617] [ 3]
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/libmpi.so.1
> [0x2af788a9a596]
> [cl2n004:11617] [ 4]
> /lustre/jasper/software/openmpi/openmpi-1.6-intel/lib/openmpi/mca_btl_openib.so
> [0x2af78c916654]
> [cl2n004:11617] [ 5] /lib64/libpthread.so.0 [0x376ca0677d]
> [cl2n004:11617] [ 6] /lib64/libc.so.6(clone+0x6d) [0x376bed325d]
> [cl2n004:11617] *** End of error message ***
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 4 with PID 11617 on node cl2n004
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
> It seems that failures happen for arbitrary reasons. When I added a line
> in the PBS script to print out the node allocation, the procs=24 case
> failed, but then it worked a few seconds later, with the same list of
> allocated nodes. So there's definitely something amiss with the cluster,
> although I wouldn't know where to start investigating. Perhaps there is
> a pre-installed OMPI somewhere that's interfering, but I'm doubtful.
>
> By the way, thanks for all the support.
>
> --
> Edmund Sumbar
> University of Alberta
> +1 780 492 9360
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users