Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI causing WRF to crash
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-08-06 09:25:59


Do you have something like valgrind on your machine? If so, then why not launch your apps under valgrind - eg., "mpirun .... valgrind my_app"?

If your app is segfaulting, there isn't much OMPI can do to tell you why. All we can do is tell you that your app was hit with a SIGTERM.

Did you talk to your sys admin? Like Jeff said, that probably means you hit some system-imposed limit and the resource manager killed you.

On Aug 5, 2011, at 11:55 PM, BasitAli Khan wrote:

> Hi David,
> Unfortunately there is no information about error in the rsl.out.*,
> rsl.error and wrf.out files. The error message mentioned in the previous
> email appeared in the wrf.err file. Both rsl.out and rsl.error shows
> stopping of integration at the time of crash and that is it. I am just
> wondering if there is a way to monitor processes and to know the reason if
> some process dies.
>
> Cheers,
> ---
>
> Basit A. Khan, Ph.D.
> Postdoctoral Fellow
> Division of Physical Sciences & Engineering
> Office# 3204, Level 3, Building 1,
> King Abdullah University of Science & Technology
> 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 ­6900,
> Kingdom of Saudi Arabia.
>
> Office: +966(0)2 808 0276, Mobile: +966(0)5 9538 7592
> E-mail: basitali.khan_at_[hidden]
> Skype name: basit.a.khan
>
>
>
>
> On 8/5/11 8:43 PM, "David Warren" <warren_at_[hidden]> wrote:
>
>> That error is from one of the processes that was working when another
>> one died. It is not an indication that MPI had problems, but that you
>> had one of the wrf processes (#45) crash. You need to look at what
>> happened to process 45. What do the rsl.out and rsl.error files for #45
>> say?
>>
>> On 08/04/11 16:18, Jeff Squyres wrote:
>>> Signal 15 is usually SIGTERM on Linux, meaning that some external
>>> entity probably killed the job.
>>>
>>> The OMPI error message you describe is also typical for that kind of
>>> scenario -- i.e., a process exited without calling MPI_Finalize could
>>> mean that it called exit() or some external process killed it.
>>>
>>>
>>> On Aug 3, 2011, at 7:24 AM, BasitAli Khan wrote:
>>>
>>>
>>>> I am trying to run a rather heavy wrf simulation with spectral nudging
>>>> but the simulation crashes after 1.8 minutes of integration.
>>>> The simulation has two domains with d01 = 601x601 and d02 =
>>>> 721x721 and 51 vertical levels. I tried this simulation on two
>>>> different systems but result was more or less same. For example
>>>>
>>>> On our Bluegene/P with SUSE Linux Enterprise Server 10 ppc and XLF
>>>> compiler I tried to run wrf on 2048 shared memory nodes (1 compute node
>>>> = 4 cores , 32 bit, 850 Mhz). For the parallel run I used mpixlc,
>>>> mpixlcxx and mpixlf90. I got the following error message in the
>>>> wrf.err file
>>>>
>>>> <Aug 01 19:50:21.244540> BE_MPI (ERROR): The error message in the job
>>>> record is as follows:
>>>> <Aug 01 19:50:21.244657> BE_MPI (ERROR): "killed with signal 15"
>>>>
>>>> I also tried to run the same simulation on our linux cluster (Linux
>>>> Red Hat Enterprise 5.4m x86_64 and Intel compiler) with 8, 16 and 64
>>>> nodes (1 compute node=8 cores). For the parallel run I am used
>>>> mpi/openmpi/1.4.2-intel-11. I got the following error message in the
>>>> error log after couple of minutes of integration.
>>>>
>>>> "mpirun has exited due to process rank 45 with PID 19540 on
>>>> node ci118 exiting without calling "finalize". This may
>>>> have caused other processes in the application to be
>>>> terminated by signals sent by mpirun (as reported here)."
>>>>
>>>> I tried many things but nothing seems to be working. However, if I
>>>> reduce grid points below 200, the simulation goes fine. It appears
>>>> that probably OpenMP has problem with large number of grid points but I
>>>> have no idea how to fix it. I will greatly appreciate if you could
>>>> suggest some solution.
>>>>
>>>> Best regards,
>>>> ---
>>>> Basit A. Khan, Ph.D.
>>>> Postdoctoral Fellow
>>>> Division of Physical Sciences& Engineering
>>>> Office# 3204, Level 3, Building 1,
>>>> King Abdullah University of Science& Technology
>>>> 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 ­6900,
>>>> Kingdom of Saudi Arabia.
>>>>
>>>> Office: +966(0)2 808 0276, Mobile: +966(0)5 9538 7592
>>>> E-mail: basitali.khan_at_[hidden]
>>>> Skype name: basit.a.khan
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users