Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] programs are segfaulting using Torque & OpenMPI
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-07-31 10:49:37


Ummm...this log indicates that OMPI ran perfectly - it is your
application that segfaulted.

Can you run gdb (or your favorite debugger) against a core file from
your app? It looks like something in your app is crashing.

As far as I can tell, everything is working fine. We launch and wireup
just fine, then detect one of your processes has segfaulted - which
triggers us to kill the remaining processes and terminate the job.

On Jul 31, 2009, at 8:35 AM, Wilko Keegstra wrote:

> Hi,
>
> I have recompiled openmpi with the --enabled-debug and --with-tm=/
> usr/local
> flags, and submitted the job to torque 2.3.7:
>
> #PBS -q cluster2
> #PBS -l nodes=5:ppn=2
> #PBS -N AlignImages
> #PBS -j oe
> /usr/local/bin/mpiexec -v -mca ras_base_verbose 5 -mca
> plm_base_verbose
> 5 --debug-daemons -machinefile $PBS_NODEFILE
> /pcs/programs/grip/bin/RunAlignmentMPI DoAlign
> /pcs/pc00/keegstra/work/hm/hemo-mix-psml.img
> /pcs/pc00/keegstra/work/hm/hemo-mix-psml-ali.img 4 9 14 1 2497 360.000
> 64.000 /pcs/pc00/keegstra/work/hm/hemo-mix-pref.img 1 7 0
>
> and the job crashed almost immediately. i have attached:
> tm.3.gz, Job output: AlignImages.o34.gz, momlog-20090731
>
> I hope you can help me,
> kind regards,
> Wilko
>
>
> Ralph Castain wrote:
>> Could you send the contents of a PBS_NODEFILE from a Torque 2.3.7
>> allocation, and the man page for tm_spawn?
>>
>> My only guess would be that something changed in those areas as we
>> don't
>> really use anything else from Torque, and run on Torque-based
>> clusters
>> in production every day. Not sure what version we have here, though I
>> believe it is pretty current (will check).
>>
>> You also might want to configure OMPI 1.3.3 with --enable-debug. You
>> could then do a run with -mca ras_base_verbose 5 -mca
>> plm_base_verbose 5
>> --debug-daemons on your mpirun cmd line to get a step-by-step
>> diagnostic
>> output of the interaction with Torque. Should give us some idea of
>> where
>> the failure is occurring.
>>
>> Ralph
>>
>> On Jul 31, 2009, at 7:20 AM, Wilko Keegstra wrote:
>>
>>> hi,
>>>
>>> I have the following problem:
>>>
>>> I am using openmpi 1.3.3
>>>
>>> programs (directly and from scripts) submitted with mpiexec are
>>> running
>>> fine.
>>>
>>> programs (directly and from scripts) submitted through Torque 2.3.7
>>> with openmpi compiled with --with-tm (and torque-devel) installed
>>> give segfaulting of the programs.
>>>
>>> programs submitted through Torque 2.3.7 directly with openmpi
>>> compiled without --with-tm (and NO torque-devel installed) run fine
>>> however mpiexec programs from script (script submiited through
>>> torque)
>>> are only running on 1 node, so I need openmpi compiled with --with-
>>> tm
>>>
>>> We also have a cluster running with openmpi 1.2.9 compiled without
>>> --with-tm in combination with torque 2.3.3 and everything is running
>>> fine, so NO segfaults and mpiexec from script also running on the
>>> nodes selected at submitting time.
>>>
>>> I don't have errors on log files only on the job log file:
>>>
>>> ---------------------------------------------------------------------------
>>>
>>> mpiexec noticed that process rank 7 with PID 3150 on node
>>> rugem21.chem.rug.nl exited on signal 11 (Segmentation fault).
>>> --------------------------------------------------------------------------
>>>
>>>
>>> Could anyone please help me,
>>> many thanks in advance
>>> Wilko Keegstra
>>>
>>> --
>>> +-------------------------------------------------------------+
>>> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
>>> | Groningen University email: W.Keegstra_at_[hidden] |
>>> | Groningen Biomolecular Sciences and Biotechnology Institute |
>>> | Nijenborgh 4 phone: +31503634224 |
>>> | 9747 AG GRONINGEN fax : +31503634800 |
>>> | The Netherlands |
>>> +-------------------------------------------------------------+
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> +-------------------------------------------------------------+
> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
> | Groningen University email: W.Keegstra_at_[hidden] |
> | Groningen Biomolecular Sciences and Biotechnology Institute |
> | Nijenborgh 4 phone: +31503634224 |
> | 9747 AG GRONINGEN fax : +31503634800 |
> | The Netherlands |
> +-------------------------------------------------------------+
> <tm.
> 3
> .gz
> >
> <
> AlignImages
> .o34
> .gz
> ><momlog-20090731.gz>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users