Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] programs are segfaulting using Torque & OpenMPI
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-07-31 09:28:42


Could you send the contents of a PBS_NODEFILE from a Torque 2.3.7
allocation, and the man page for tm_spawn?

My only guess would be that something changed in those areas as we
don't really use anything else from Torque, and run on Torque-based
clusters in production every day. Not sure what version we have here,
though I believe it is pretty current (will check).

You also might want to configure OMPI 1.3.3 with --enable-debug. You
could then do a run with -mca ras_base_verbose 5 -mca plm_base_verbose
5 --debug-daemons on your mpirun cmd line to get a step-by-step
diagnostic output of the interaction with Torque. Should give us some
idea of where the failure is occurring.

Ralph

On Jul 31, 2009, at 7:20 AM, Wilko Keegstra wrote:

> hi,
>
> I have the following problem:
>
> I am using openmpi 1.3.3
>
> programs (directly and from scripts) submitted with mpiexec are
> running
> fine.
>
> programs (directly and from scripts) submitted through Torque 2.3.7
> with openmpi compiled with --with-tm (and torque-devel) installed
> give segfaulting of the programs.
>
> programs submitted through Torque 2.3.7 directly with openmpi
> compiled without --with-tm (and NO torque-devel installed) run fine
> however mpiexec programs from script (script submiited through torque)
> are only running on 1 node, so I need openmpi compiled with --with-tm
>
> We also have a cluster running with openmpi 1.2.9 compiled without
> --with-tm in combination with torque 2.3.3 and everything is running
> fine, so NO segfaults and mpiexec from script also running on the
> nodes selected at submitting time.
>
> I don't have errors on log files only on the job log file:
>
> ---------------------------------------------------------------------------
> mpiexec noticed that process rank 7 with PID 3150 on node
> rugem21.chem.rug.nl exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
> Could anyone please help me,
> many thanks in advance
> Wilko Keegstra
>
> --
> +-------------------------------------------------------------+
> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
> | Groningen University email: W.Keegstra_at_[hidden] |
> | Groningen Biomolecular Sciences and Biotechnology Institute |
> | Nijenborgh 4 phone: +31503634224 |
> | 9747 AG GRONINGEN fax : +31503634800 |
> | The Netherlands |
> +-------------------------------------------------------------+
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users