Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] programs are segfaulting using Torque & OpenMPI
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-07-31 13:43:03


If you are launching without Torque, then you will be launching with
rsh or ssh. So yes, there will be some differences in the environment.
For example, launching via ssh means that you pickup your
remote .cshrc (or whatever shell flavor you use), while you don't when
launching via Torque.

You can use 1.3.3 without Torque - it will launch faster than 1.2.9,
but still use rsh/ssh to do it. Just build it --without-tm, or at
least specify -mca plm rsh on your cmd line.

As far as anyone can tell, nothing much changed between Torque 2.3.6
(which is what we run here every day) and 2.3.7 that could cause a
problem. Your debug info clearly shows the launch completing just
fine, and the app segfaulting.

Given your description of random segfaults, it sounds to me like you
have a memory corruption issue in your code. The Torque question could
just be a red herring having more to do with where in memory you are
overwriting things - could be you hit something sensitive a little
faster when running under Torque.

Anyway, if you want, launching under ssh is fine and may help you
debug your program.

On Jul 31, 2009, at 11:30 AM, W.Keegstra wrote:

> Hi,
>
> Sorry to bother you again, but the system limits are default limits
> for openSuSE 11.1, which are as far as I can see the same from
> OS version `openSuSE 10.0.
> Furthermore if I specify the node parameter such that the job is
> running on only 1 node (either with 2 or 8 cores) is runs well.
> A few weeks ago I put a lot of printf commands in the program that
> is executed and and the segfaults are completely random through
> the code. Even completely random on the nodes were the process
> is running.
> Every time tm is used because the process is spread over different
> nodes is goes wrong, if tm is not used because either the process
> is running on 1 node or openmpi is compiled without tm.
> The only thing I can think of is that the environment (??) with tm
> is different from that without tm.
> May be the fact that the program is using mpi-open, mpi-read and
> mpi-write to files has something to do with it. I tried for weeks
> to figure out what the problem is and I still have no idea. In
> my opinion everthing points to the tm interface.
> For me the only solution now is sticking to the older programs,
> e.g. torque 2.3.3 and openmpi 1.2.9, in this case tm is still not
> needed and everything runs without problems, only 10-15% slower.
>
> kind regards,
> Wilko
>
> On Fri, 31 Jul 2009 09:31:55 -0600
> Ralph Castain <rhc_at_[hidden]> wrote:
>> You might check with your sys admin - or checkout the "ulimit" cmd.
>> Depends on what the sys admin has set for system limits.
>> On Jul 31, 2009, at 9:12 AM, Wilko Keegstra wrote:
>>> Hi,
>>>
>>> Sofar I don't have a core file.
>>> the weird thing is that the same job will run well when openmpi
>>> is compiled without --with-tm.
>>> Is the amount of memory, or number of open files different in both
>>> cases?
>>> How can I force unlimited resources for the job??
>>> only then I will get a core file.
>>>
>>> kind regards,
>>> Wilko
>>>
>>> Ralph Castain wrote:
>>>> Ummm...this log indicates that OMPI ran perfectly - it is your
>>>> application that segfaulted.
>>>>
>>>> Can you run gdb (or your favorite debugger) against a core file
>>>> from
>>>> your app? It looks like something in your app is crashing.
>>>>
>>>> As far as I can tell, everything is working fine. We launch and
>>>> wireup
>>>> just fine, then detect one of your processes has segfaulted - which
>>>> triggers us to kill the remaining processes and terminate the job.
>>>>
>>>>
>>>> On Jul 31, 2009, at 8:35 AM, Wilko Keegstra wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have recompiled openmpi with the --enabled-debug and
>>>>> --with-tm=/usr/local
>>>>> flags, and submitted the job to torque 2.3.7:
>>>>>
>>>>> #PBS -q cluster2
>>>>> #PBS -l nodes=5:ppn=2
>>>>> #PBS -N AlignImages
>>>>> #PBS -j oe
>>>>> /usr/local/bin/mpiexec -v -mca ras_base_verbose 5 -mca
>>>>> plm_base_verbose
>>>>> 5 --debug-daemons -machinefile $PBS_NODEFILE
>>>>> /pcs/programs/grip/bin/RunAlignmentMPI DoAlign
>>>>> /pcs/pc00/keegstra/work/hm/hemo-mix-psml.img
>>>>> /pcs/pc00/keegstra/work/hm/hemo-mix-psml-ali.img 4 9 14 1 2497
>>>>> 360.000
>>>>> 64.000 /pcs/pc00/keegstra/work/hm/hemo-mix-pref.img 1 7 0
>>>>>
>>>>> and the job crashed almost immediately. i have attached:
>>>>> tm.3.gz, Job output: AlignImages.o34.gz, momlog-20090731
>>>>>
>>>>> I hope you can help me,
>>>>> kind regards,
>>>>> Wilko
>>>>>
>>>>>
>>>>> Ralph Castain wrote:
>>>>>> Could you send the contents of a PBS_NODEFILE from a Torque 2.3.7
>>>>>> allocation, and the man page for tm_spawn?
>>>>>>
>>>>>> My only guess would be that something changed in those areas
>>>>>> as we don't
>>>>>> really use anything else from Torque, and run on Torque-based
>>>>>> clusters
>>>>>> in production every day. Not sure what version we have here,
>>>>>> though I
>>>>>> believe it is pretty current (will check).
>>>>>>
>>>>>> You also might want to configure OMPI 1.3.3 with --enable-
>>>>>> debug. You
>>>>>> could then do a run with -mca ras_base_verbose 5 -mca
>>>>>> plm_base_verbose 5
>>>>>> --debug-daemons on your mpirun cmd line to get a step-by-step
>>>>>> diagnostic
>>>>>> output of the interaction with Torque. Should give us some
>>>>>> idea of where
>>>>>> the failure is occurring.
>>>>>>
>>>>>> Ralph
>>>>>>
>>>>>> On Jul 31, 2009, at 7:20 AM, Wilko Keegstra wrote:
>>>>>>
>>>>>>> hi,
>>>>>>>
>>>>>>> I have the following problem:
>>>>>>>
>>>>>>> I am using openmpi 1.3.3
>>>>>>>
>>>>>>> programs (directly and from scripts) submitted with mpiexec
>>>>>>> are running
>>>>>>> fine.
>>>>>>>
>>>>>>> programs (directly and from scripts) submitted through Torque
>>>>>>> 2.3.7
>>>>>>> with openmpi compiled with --with-tm (and torque-devel)
>>>>>>> installed
>>>>>>> give segfaulting of the programs.
>>>>>>>
>>>>>>> programs submitted through Torque 2.3.7 directly with openmpi
>>>>>>> compiled without --with-tm (and NO torque-devel installed)
>>>>>>> run fine
>>>>>>> however mpiexec programs from script (script submiited
>>>>>>> through torque)
>>>>>>> are only running on 1 node, so I need openmpi compiled with --
>>>>>>> with-tm
>>>>>>>
>>>>>>> We also have a cluster running with openmpi 1.2.9 compiled
>>>>>>> without
>>>>>>> --with-tm in combination with torque 2.3.3 and everything is
>>>>>>> running
>>>>>>> fine, so NO segfaults and mpiexec from script also running on
>>>>>>> the
>>>>>>> nodes selected at submitting time.
>>>>>>>
>>>>>>> I don't have errors on log files only on the job log file:
>>>>>>>
>>>>>>> ---------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>> mpiexec noticed that process rank 7 with PID 3150 on node
>>>>>>> rugem21.chem.rug.nl exited on signal 11 (Segmentation fault).
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Could anyone please help me,
>>>>>>> many thanks in advance
>>>>>>> Wilko Keegstra
>>>>>>>
>>>>>>> --
>>>>>>> +-------------------------------------------------------------+
>>>>>>> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
>>>>>>> | Groningen University email: W.Keegstra_at_[hidden] |
>>>>>>> | Groningen Biomolecular Sciences and Biotechnology Institute |
>>>>>>> | Nijenborgh 4 phone: +31503634224 |
>>>>>>> | 9747 AG GRONINGEN fax : +31503634800 |
>>>>>>> | The Netherlands |
>>>>>>> +-------------------------------------------------------------+
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> --
>>>>> +-------------------------------------------------------------+
>>>>> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
>>>>> | Groningen University email: W.Keegstra_at_[hidden] |
>>>>> | Groningen Biomolecular Sciences and Biotechnology Institute |
>>>>> | Nijenborgh 4 phone: +31503634224 |
>>>>> | 9747 AG GRONINGEN fax : +31503634800 |
>>>>> | The Netherlands |
>>>>> +-------------------------------------------------------------+
>>>>> <tm. 3 .gz > < AlignImages .o34 .gz
>>>>> >
>>>>> <
>>>>> momlog-20090731.gz>_______________________________________________
>>>>>
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> --
>>> +-------------------------------------------------------------+
>>> | Dr. Wilko Keegstra priv.phone: +31594514153,+31610477915 |
>>> | Groningen University email: W.Keegstra_at_[hidden] |
>>> | Groningen Biomolecular Sciences and Biotechnology Institute |
>>> | Nijenborgh 4 phone: +31503634224 |
>>> | 9747 AG GRONINGEN fax : +31503634800 |
>>> | The Netherlands |
>>> +-------------------------------------------------------------+
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users