Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] programs are segfaulting using Torque & OpenMPI
From: Wilko Keegstra (w.keegstra_at_[hidden])
Date: 2009-07-31 11:12:29


Hi,

Sofar I don't have a core file.
the weird thing is that the same job will run well when openmpi
is compiled without --with-tm.
Is the amount of memory, or number of open files different in both
cases?
How can I force unlimited resources for the job??
only then I will get a core file.

kind regards,
Wilko

Ralph Castain wrote:
> Ummm...this log indicates that OMPI ran perfectly - it is your
> application that segfaulted.
>
> Can you run gdb (or your favorite debugger) against a core file from
> your app? It looks like something in your app is crashing.
>
> As far as I can tell, everything is working fine. We launch and wireup
> just fine, then detect one of your processes has segfaulted - which
> triggers us to kill the remaining processes and terminate the job.
>
>
> On Jul 31, 2009, at 8:35 AM, Wilko Keegstra wrote:
>
>> Hi,
>>
>> I have recompiled openmpi with the --enabled-debug and
>> --with-tm=/usr/local
>> flags, and submitted the job to torque 2.3.7:
>>
>> #PBS -q cluster2
>> #PBS -l nodes=5:ppn=2
>> #PBS -N AlignImages
>> #PBS -j oe
>> /usr/local/bin/mpiexec -v -mca ras_base_verbose 5 -mca plm_base_verbose
>> 5 --debug-daemons -machinefile $PBS_NODEFILE
>> /pcs/programs/grip/bin/RunAlignmentMPI DoAlign
>> /pcs/pc00/keegstra/work/hm/hemo-mix-psml.img
>> /pcs/pc00/keegstra/work/hm/hemo-mix-psml-ali.img 4 9 14 1 2497 360.000
>> 64.000 /pcs/pc00/keegstra/work/hm/hemo-mix-pref.img 1 7 0
>>
>> and the job crashed almost immediately. i have attached:
>> tm.3.gz, Job output: AlignImages.o34.gz, momlog-20090731
>>
>> I hope you can help me,
>> kind regards,
>> Wilko
>>
>>
>> Ralph Castain wrote:
>>> Could you send the contents of a PBS_NODEFILE from a Torque 2.3.7
>>> allocation, and the man page for tm_spawn?
>>>
>>> My only guess would be that something changed in those areas as we don't
>>> really use anything else from Torque, and run on Torque-based clusters
>>> in production every day. Not sure what version we have here, though I
>>> believe it is pretty current (will check).
>>>
>>> You also might want to configure OMPI 1.3.3 with --enable-debug. You
>>> could then do a run with -mca ras_base_verbose 5 -mca plm_base_verbose 5
>>> --debug-daemons on your mpirun cmd line to get a step-by-step diagnostic
>>> output of the interaction with Torque. Should give us some idea of where
>>> the failure is occurring.
>>>
>>> Ralph
>>>
>>> On Jul 31, 2009, at 7:20 AM, Wilko Keegstra wrote:
>>>
>>>> hi,
>>>>
>>>> I have the following problem:
>>>>
>>>> I am using openmpi 1.3.3
>>>>
>>>> programs (directly and from scripts) submitted with mpiexec are running
>>>> fine.
>>>>
>>>> programs (directly and from scripts) submitted through Torque 2.3.7
>>>> with openmpi compiled with --with-tm (and torque-devel) installed
>>>> give segfaulting of the programs.
>>>>
>>>> programs submitted through Torque 2.3.7 directly with openmpi
>>>> compiled without --with-tm (and NO torque-devel installed) run fine
>>>> however mpiexec programs from script (script submiited through torque)
>>>> are only running on 1 node, so I need openmpi compiled with --with-tm
>>>>
>>>> We also have a cluster running with openmpi 1.2.9 compiled without
>>>> --with-tm in combination with torque 2.3.3 and everything is running
>>>> fine, so NO segfaults and mpiexec from script also running on the
>>>> nodes selected at submitting time.
>>>>
>>>> I don't have errors on log files only on the job log file:
>>>>
>>>> ---------------------------------------------------------------------------
>>>>
>>>>
>>>> mpiexec noticed that process rank 7 with PID 3150 on node
>>>> rugem21.chem.rug.nl exited on signal 11 (Segmentation fault).
>>>> --------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>> Could anyone please help me,
>>>> many thanks in advance
>>>> Wilko Keegstra
>>>>
>>>> --
>>>> +-------------------------------------------------------------+
>>>> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
>>>> | Groningen University email: W.Keegstra_at_[hidden] |
>>>> | Groningen Biomolecular Sciences and Biotechnology Institute |
>>>> | Nijenborgh 4 phone: +31503634224 |
>>>> | 9747 AG GRONINGEN fax : +31503634800 |
>>>> | The Netherlands |
>>>> +-------------------------------------------------------------+
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> --
>> +-------------------------------------------------------------+
>> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
>> | Groningen University email: W.Keegstra_at_[hidden] |
>> | Groningen Biomolecular Sciences and Biotechnology Institute |
>> | Nijenborgh 4 phone: +31503634224 |
>> | 9747 AG GRONINGEN fax : +31503634800 |
>> | The Netherlands |
>> +-------------------------------------------------------------+
>> <tm.3.gz><AlignImages.o34.gz><momlog-20090731.gz>_______________________________________________
>>
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
+-------------------------------------------------------------+
| Dr. Wilko Keegstra    priv.phone: +31594514153,+31610477915 |
| Groningen University       email: W.Keegstra_at_[hidden]         |
| Groningen Biomolecular Sciences and Biotechnology Institute |
| Nijenborgh 4               phone: +31503634224              |
| 9747 AG GRONINGEN          fax  : +31503634800              |
| The Netherlands                                             |
+-------------------------------------------------------------+