Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] programs are segfaulting using Torque & OpenMPI
From: W.Keegstra (W.Keegstra_at_[hidden])
Date: 2009-07-31 15:15:16


Hi Gus,

your first suggestion did the trick, it is working now

thank you very much and also thanks to Ralph for helping out

Wilko

On Fri, 31 Jul 2009 14:00:05 -0400
  Gus Correa <gus_at_[hidden]> wrote:
> Hi Wilco, list
>
> Two wild guesses:
>
> 1) Check if the pbs_mom daemon script on your nodes (in /etc/init.d
>on RHEL/CentOS/Fedora type of Linux) set the system limits properly,
> in particular the stacksize. Something like this:
>
> ulimit -n 32768
> ulimit -s unlimited
> ulimit -l unlimited
>
> We had problems with this in the past,
> with programs segfaulting for no apparent reason (most of the time
> the default stack size was too small).
>
>
> 2) Make sure the Torque libtm you linked OpenMPI to is the one that
>corresponds to your Torque 2.3.7 (i.e.
>--with-tm=/full/path/to/torque-2.3.7/library/directory
>
> If you have more than one version of torque installed on your
>system,
> using the full path will prevent picking the wrong version.
>
> My $0.02
> Gus Correa
>
>> On Fri, Jul 31, 2009 at 11:31 AM, Ralph Castain<rhc_at_[hidden]>
>>wrote:
>>> You might check with your sys admin - or checkout the "ulimit" cmd.
>>>Depends
>>> on what the sys admin has set for system limits.
>>>
>>>
>>> On Jul 31, 2009, at 9:12 AM, Wilko Keegstra wrote:
>>>
>>>> Hi,
>>>>
>>>> Sofar I don't have a core file.
>>>> the weird thing is that the same job will run well when openmpi
>>>> is compiled without --with-tm.
>>>> Is the amount of memory, or number of open files different in both
>>>> cases?
>>>> How can I force unlimited resources for the job??
>>>> only then I will get a core file.
>>>>
>>>> kind regards,
>>>> Wilko
>>>>
>>>> Ralph Castain wrote:
>>>>> Ummm...this log indicates that OMPI ran perfectly - it is your
>>>>> application that segfaulted.
>>>>>
>>>>> Can you run gdb (or your favorite debugger) against a core file from
>>>>> your app? It looks like something in your app is crashing.
>>>>>
>>>>> As far as I can tell, everything is working fine. We launch and
>>>>>wireup
>>>>> just fine, then detect one of your processes has segfaulted - which
>>>>> triggers us to kill the remaining processes and terminate the job.
>>>>>
>>>>>
>>>>> On Jul 31, 2009, at 8:35 AM, Wilko Keegstra wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have recompiled openmpi with the --enabled-debug and
>>>>>> --with-tm=/usr/local
>>>>>> flags, and submitted the job to torque 2.3.7:
>>>>>>
>>>>>> #PBS -q cluster2
>>>>>> #PBS -l nodes=5:ppn=2
>>>>>> #PBS -N AlignImages
>>>>>> #PBS -j oe
>>>>>> /usr/local/bin/mpiexec -v -mca ras_base_verbose 5 -mca
>>>>>>plm_base_verbose
>>>>>> 5 --debug-daemons -machinefile $PBS_NODEFILE
>>>>>> /pcs/programs/grip/bin/RunAlignmentMPI DoAlign
>>>>>> /pcs/pc00/keegstra/work/hm/hemo-mix-psml.img
>>>>>> /pcs/pc00/keegstra/work/hm/hemo-mix-psml-ali.img 4 9 14 1 2497
>>>>>>360.000
>>>>>> 64.000 /pcs/pc00/keegstra/work/hm/hemo-mix-pref.img 1 7 0
>>>>>>
>>>>>> and the job crashed almost immediately. i have attached:
>>>>>> tm.3.gz, Job output: AlignImages.o34.gz, momlog-20090731
>>>>>>
>>>>>> I hope you can help me,
>>>>>> kind regards,
>>>>>> Wilko
>>>>>>
>>>>>>
>>>>>> Ralph Castain wrote:
>>>>>>> Could you send the contents of a PBS_NODEFILE from a Torque 2.3.7
>>>>>>> allocation, and the man page for tm_spawn?
>>>>>>>
>>>>>>> My only guess would be that something changed in those areas as we
>>>>>>> don't
>>>>>>> really use anything else from Torque, and run on Torque-based
>>>>>>>clusters
>>>>>>> in production every day. Not sure what version we have here, though
>>>>>>>I
>>>>>>> believe it is pretty current (will check).
>>>>>>>
>>>>>>> You also might want to configure OMPI 1.3.3 with --enable-debug. You
>>>>>>> could then do a run with -mca ras_base_verbose 5 -mca
>>>>>>>plm_base_verbose
>>>>>>> 5
>>>>>>> --debug-daemons on your mpirun cmd line to get a step-by-step
>>>>>>> diagnostic
>>>>>>> output of the interaction with Torque. Should give us some idea of
>>>>>>> where
>>>>>>> the failure is occurring.
>>>>>>>
>>>>>>> Ralph
>>>>>>>
>>>>>>> On Jul 31, 2009, at 7:20 AM, Wilko Keegstra wrote:
>>>>>>>
>>>>>>>> hi,
>>>>>>>>
>>>>>>>> I have the following problem:
>>>>>>>>
>>>>>>>> I am using openmpi 1.3.3
>>>>>>>>
>>>>>>>> programs (directly and from scripts) submitted with mpiexec are
>>>>>>>> running
>>>>>>>> fine.
>>>>>>>>
>>>>>>>> programs (directly and from scripts) submitted through Torque 2.3.7
>>>>>>>> with openmpi compiled with --with-tm (and torque-devel) installed
>>>>>>>> give segfaulting of the programs.
>>>>>>>>
>>>>>>>> programs submitted through Torque 2.3.7 directly with openmpi
>>>>>>>> compiled without --with-tm (and NO torque-devel installed) run fine
>>>>>>>> however mpiexec programs from script (script submiited through
>>>>>>>>torque)
>>>>>>>> are only running on 1 node, so I need openmpi compiled with
>>>>>>>>--with-tm
>>>>>>>>
>>>>>>>> We also have a cluster running with openmpi 1.2.9 compiled without
>>>>>>>> --with-tm in combination with torque 2.3.3 and everything is running
>>>>>>>> fine, so NO segfaults and mpiexec from script also running on the
>>>>>>>> nodes selected at submitting time.
>>>>>>>>
>>>>>>>> I don't have errors on log files only on the job log file:
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>> mpiexec noticed that process rank 7 with PID 3150 on node
>>>>>>>> rugem21.chem.rug.nl exited on signal 11 (Segmentation fault).
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Could anyone please help me,
>>>>>>>> many thanks in advance
>>>>>>>> Wilko Keegstra
>>>>>>>>
>>>>>>>> --
>>>>>>>> +-------------------------------------------------------------+
>>>>>>>> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
>>>>>>>> | Groningen University email: W.Keegstra_at_[hidden] |
>>>>>>>> | Groningen Biomolecular Sciences and Biotechnology Institute |
>>>>>>>> | Nijenborgh 4 phone: +31503634224 |
>>>>>>>> | 9747 AG GRONINGEN fax : +31503634800 |
>>>>>>>> | The Netherlands |
>>>>>>>> +-------------------------------------------------------------+
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> users_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> --
>>>>>> +-------------------------------------------------------------+
>>>>>> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
>>>>>> | Groningen University email: W.Keegstra_at_[hidden] |
>>>>>> | Groningen Biomolecular Sciences and Biotechnology Institute |
>>>>>> | Nijenborgh 4 phone: +31503634224 |
>>>>>> | 9747 AG GRONINGEN fax : +31503634800 |
>>>>>> | The Netherlands |
>>>>>> +-------------------------------------------------------------+
>>>>>>
>>>>>> <tm.3.gz><AlignImages.o34.gz><momlog-20090731.gz>_______________________________________________
>>>>>>
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> --
>>>> +-------------------------------------------------------------+
>>>> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
>>>> | Groningen University email: W.Keegstra_at_[hidden] |
>>>> | Groningen Biomolecular Sciences and Biotechnology Institute |
>>>> | Nijenborgh 4 phone: +31503634224 |
>>>> | 9747 AG GRONINGEN fax : +31503634800 |
>>>> | The Netherlands |
>>>> +-------------------------------------------------------------+
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users