Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] programs are segfaulting using Torque & OpenMPI
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-07-31 10:49:37


Ummm...this log indicates that OMPI ran perfectly - it is your
application that segfaulted.

Can you run gdb (or your favorite debugger) against a core file from
your app? It looks like something in your app is crashing.

As far as I can tell, everything is working fine. We launch and wireup
just fine, then detect one of your processes has segfaulted - which
triggers us to kill the remaining processes and terminate the job.

On Jul 31, 2009, at 8:35 AM, Wilko Keegstra wrote:

> Hi,
>
> I have recompiled openmpi with the --enabled-debug and --with-tm=/
> usr/local
> flags, and submitted the job to torque 2.3.7:
>
> #PBS -q cluster2
> #PBS -l nodes=5:ppn=2
> #PBS -N AlignImages
> #PBS -j oe
> /usr/local/bin/mpiexec -v -mca ras_base_verbose 5 -mca
> plm_base_verbose
> 5 --debug-daemons -machinefile $PBS_NODEFILE
> /pcs/programs/grip/bin/RunAlignmentMPI DoAlign
> /pcs/pc00/keegstra/work/hm/hemo-mix-psml.img
> /pcs/pc00/keegstra/work/hm/hemo-mix-psml-ali.img 4 9 14 1 2497 360.000
> 64.000 /pcs/pc00/keegstra/work/hm/hemo-mix-pref.img 1 7 0
>
> and the job crashed almost immediately. i have attached:
> tm.3.gz, Job output: AlignImages.o34.gz, momlog-20090731
>
> I hope you can help me,
> kind regards,
> Wilko
>
>
> Ralph Castain wrote:
>> Could you send the contents of a PBS_NODEFILE from a Torque 2.3.7
>> allocation, and the man page for tm_spawn?
>>
>> My only guess would be that something changed in those areas as we
>> don't
>> really use anything else from Torque, and run on Torque-based
>> clusters
>> in production every day. Not sure what version we have here, though I
>> believe it is pretty current (will check).
>>
>> You also might want to configure OMPI 1.3.3 with --enable-debug. You
>> could then do a run with -mca ras_base_verbose 5 -mca
>> plm_base_verbose 5
>> --debug-daemons on your mpirun cmd line to get a step-by-step
>> diagnostic
>> output of the interaction with Torque. Should give us some idea of
>> where
>> the failure is occurring.
>>
>> Ralph
>>
>> On Jul 31, 2009, at 7:20 AM, Wilko Keegstra wrote:
>>
>>> hi,
>>>
>>> I have the following problem:
>>>
>>> I am using openmpi 1.3.3
>>>
>>> programs (directly and from scripts) submitted with mpiexec are
>>> running
>>> fine.
>>>
>>> programs (directly and from scripts) submitted through Torque 2.3.7
>>> with openmpi compiled with --with-tm (and torque-devel) installed
>>> give segfaulting of the programs.
>>>
>>> programs submitted through Torque 2.3.7 directly with openmpi
>>> compiled without --with-tm (and NO torque-devel installed) run fine
>>> however mpiexec programs from script (script submiited through
>>> torque)
>>> are only running on 1 node, so I need openmpi compiled with --with-
>>> tm
>>>
>>> We also have a cluster running with openmpi 1.2.9 compiled without
>>> --with-tm in combination with torque 2.3.3 and everything is running
>>> fine, so NO segfaults and mpiexec from script also running on the
>>> nodes selected at submitting time.
>>>
>>> I don't have errors on log files only on the job log file:
>>>
>>> ---------------------------------------------------------------------------
>>>
>>> mpiexec noticed that process rank 7 with PID 3150 on node
>>> rugem21.chem.rug.nl exited on signal 11 (Segmentation fault).
>>> --------------------------------------------------------------------------
>>>
>>>
>>> Could anyone please help me,
>>> many thanks in advance
>>> Wilko Keegstra
>>>
>>> --
>>> +-------------------------------------------------------------+
>>> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
>>> | Groningen University email: W.Keegstra_at_[hidden] |
>>> | Groningen Biomolecular Sciences and Biotechnology Institute |
>>> | Nijenborgh 4 phone: +31503634224 |
>>> | 9747 AG GRONINGEN fax : +31503634800 |
>>> | The Netherlands |
>>> +-------------------------------------------------------------+
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> +-------------------------------------------------------------+
> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
> | Groningen University email: W.Keegstra_at_[hidden] |
> | Groningen Biomolecular Sciences and Biotechnology Institute |
> | Nijenborgh 4 phone: +31503634224 |
> | 9747 AG GRONINGEN fax : +31503634800 |
> | The Netherlands |
> +-------------------------------------------------------------+
> <tm.
> 3
> .gz
> >
> <
> AlignImages
> .o34
> .gz
> ><momlog-20090731.gz>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users