Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] programs are segfaulting using Torque & OpenMPI
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-07-31 11:31:55


You might check with your sys admin - or checkout the "ulimit" cmd.
Depends on what the sys admin has set for system limits.

On Jul 31, 2009, at 9:12 AM, Wilko Keegstra wrote:

> Hi,
>
> Sofar I don't have a core file.
> the weird thing is that the same job will run well when openmpi
> is compiled without --with-tm.
> Is the amount of memory, or number of open files different in both
> cases?
> How can I force unlimited resources for the job??
> only then I will get a core file.
>
> kind regards,
> Wilko
>
> Ralph Castain wrote:
>> Ummm...this log indicates that OMPI ran perfectly - it is your
>> application that segfaulted.
>>
>> Can you run gdb (or your favorite debugger) against a core file from
>> your app? It looks like something in your app is crashing.
>>
>> As far as I can tell, everything is working fine. We launch and
>> wireup
>> just fine, then detect one of your processes has segfaulted - which
>> triggers us to kill the remaining processes and terminate the job.
>>
>>
>> On Jul 31, 2009, at 8:35 AM, Wilko Keegstra wrote:
>>
>>> Hi,
>>>
>>> I have recompiled openmpi with the --enabled-debug and
>>> --with-tm=/usr/local
>>> flags, and submitted the job to torque 2.3.7:
>>>
>>> #PBS -q cluster2
>>> #PBS -l nodes=5:ppn=2
>>> #PBS -N AlignImages
>>> #PBS -j oe
>>> /usr/local/bin/mpiexec -v -mca ras_base_verbose 5 -mca
>>> plm_base_verbose
>>> 5 --debug-daemons -machinefile $PBS_NODEFILE
>>> /pcs/programs/grip/bin/RunAlignmentMPI DoAlign
>>> /pcs/pc00/keegstra/work/hm/hemo-mix-psml.img
>>> /pcs/pc00/keegstra/work/hm/hemo-mix-psml-ali.img 4 9 14 1 2497
>>> 360.000
>>> 64.000 /pcs/pc00/keegstra/work/hm/hemo-mix-pref.img 1 7 0
>>>
>>> and the job crashed almost immediately. i have attached:
>>> tm.3.gz, Job output: AlignImages.o34.gz, momlog-20090731
>>>
>>> I hope you can help me,
>>> kind regards,
>>> Wilko
>>>
>>>
>>> Ralph Castain wrote:
>>>> Could you send the contents of a PBS_NODEFILE from a Torque 2.3.7
>>>> allocation, and the man page for tm_spawn?
>>>>
>>>> My only guess would be that something changed in those areas as
>>>> we don't
>>>> really use anything else from Torque, and run on Torque-based
>>>> clusters
>>>> in production every day. Not sure what version we have here,
>>>> though I
>>>> believe it is pretty current (will check).
>>>>
>>>> You also might want to configure OMPI 1.3.3 with --enable-debug.
>>>> You
>>>> could then do a run with -mca ras_base_verbose 5 -mca
>>>> plm_base_verbose 5
>>>> --debug-daemons on your mpirun cmd line to get a step-by-step
>>>> diagnostic
>>>> output of the interaction with Torque. Should give us some idea
>>>> of where
>>>> the failure is occurring.
>>>>
>>>> Ralph
>>>>
>>>> On Jul 31, 2009, at 7:20 AM, Wilko Keegstra wrote:
>>>>
>>>>> hi,
>>>>>
>>>>> I have the following problem:
>>>>>
>>>>> I am using openmpi 1.3.3
>>>>>
>>>>> programs (directly and from scripts) submitted with mpiexec are
>>>>> running
>>>>> fine.
>>>>>
>>>>> programs (directly and from scripts) submitted through Torque
>>>>> 2.3.7
>>>>> with openmpi compiled with --with-tm (and torque-devel) installed
>>>>> give segfaulting of the programs.
>>>>>
>>>>> programs submitted through Torque 2.3.7 directly with openmpi
>>>>> compiled without --with-tm (and NO torque-devel installed) run
>>>>> fine
>>>>> however mpiexec programs from script (script submiited through
>>>>> torque)
>>>>> are only running on 1 node, so I need openmpi compiled with --
>>>>> with-tm
>>>>>
>>>>> We also have a cluster running with openmpi 1.2.9 compiled without
>>>>> --with-tm in combination with torque 2.3.3 and everything is
>>>>> running
>>>>> fine, so NO segfaults and mpiexec from script also running on the
>>>>> nodes selected at submitting time.
>>>>>
>>>>> I don't have errors on log files only on the job log file:
>>>>>
>>>>> ---------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> mpiexec noticed that process rank 7 with PID 3150 on node
>>>>> rugem21.chem.rug.nl exited on signal 11 (Segmentation fault).
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>> Could anyone please help me,
>>>>> many thanks in advance
>>>>> Wilko Keegstra
>>>>>
>>>>> --
>>>>> +-------------------------------------------------------------+
>>>>> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
>>>>> | Groningen University email: W.Keegstra_at_[hidden] |
>>>>> | Groningen Biomolecular Sciences and Biotechnology Institute |
>>>>> | Nijenborgh 4 phone: +31503634224 |
>>>>> | 9747 AG GRONINGEN fax : +31503634800 |
>>>>> | The Netherlands |
>>>>> +-------------------------------------------------------------+
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> --
>>> +-------------------------------------------------------------+
>>> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
>>> | Groningen University email: W.Keegstra_at_[hidden] |
>>> | Groningen Biomolecular Sciences and Biotechnology Institute |
>>> | Nijenborgh 4 phone: +31503634224 |
>>> | 9747 AG GRONINGEN fax : +31503634800 |
>>> | The Netherlands |
>>> +-------------------------------------------------------------+
>>> <tm.
>>> 3
>>> .gz
>>> >
>>> <
>>> AlignImages
>>> .o34
>>> .gz
>>> ><momlog-20090731.gz>_______________________________________________
>>>
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> +-------------------------------------------------------------+
> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
> | Groningen University email: W.Keegstra_at_[hidden] |
> | Groningen Biomolecular Sciences and Biotechnology Institute |
> | Nijenborgh 4 phone: +31503634224 |
> | 9747 AG GRONINGEN fax : +31503634800 |
> | The Netherlands |
> +-------------------------------------------------------------+
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users