Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI causing WRF to crash
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-08-04 19:18:36


Signal 15 is usually SIGTERM on Linux, meaning that some external entity probably killed the job.

The OMPI error message you describe is also typical for that kind of scenario -- i.e., a process exited without calling MPI_Finalize could mean that it called exit() or some external process killed it.

On Aug 3, 2011, at 7:24 AM, BasitAli Khan wrote:

> I am trying to run a rather heavy wrf simulation with spectral nudging but the simulation crashes after 1.8 minutes of integration.
> The simulation has two domains with d01 = 601x601 and d02 = 721x721 and 51 vertical levels. I tried this simulation on two different systems but result was more or less same. For example
>
> On our Bluegene/P with SUSE Linux Enterprise Server 10 ppc and XLF compiler I tried to run wrf on 2048 shared memory nodes (1 compute node = 4 cores , 32 bit, 850 Mhz). For the parallel run I used mpixlc, mpixlcxx and mpixlf90. I got the following error message in the wrf.err file
>
> <Aug 01 19:50:21.244540> BE_MPI (ERROR): The error message in the job
> record is as follows:
> <Aug 01 19:50:21.244657> BE_MPI (ERROR): "killed with signal 15"
>
> I also tried to run the same simulation on our linux cluster (Linux Red Hat Enterprise 5.4m x86_64 and Intel compiler) with 8, 16 and 64 nodes (1 compute node=8 cores). For the parallel run I am used mpi/openmpi/1.4.2-intel-11. I got the following error message in the error log after couple of minutes of integration.
>
> "mpirun has exited due to process rank 45 with PID 19540 on
> node ci118 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here)."
>
> I tried many things but nothing seems to be working. However, if I reduce grid points below 200, the simulation goes fine. It appears that probably OpenMP has problem with large number of grid points but I have no idea how to fix it. I will greatly appreciate if you could suggest some solution.
>
> Best regards,
> ---
> Basit A. Khan, Ph.D.
> Postdoctoral Fellow
> Division of Physical Sciences & Engineering
> Office# 3204, Level 3, Building 1,
> King Abdullah University of Science & Technology
> 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 –6900,
> Kingdom of Saudi Arabia.
>
> Office: +966(0)2 808 0276, Mobile: +966(0)5 9538 7592
> E-mail: basitali.khan_at_[hidden]
> Skype name: basit.a.khan
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/